Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn nrefs into atomic_int #3370

Merged
merged 4 commits into from
Oct 15, 2024
Merged

Conversation

ffesti
Copy link
Contributor

@ffesti ffesti commented Oct 11, 2024

Throughout the code base reference counting is done in a thread-unsave way. A few places use the content lock (aka mutex) to also protect the reference counter. This is wrong and complicates freeing interconnected data structures. This lock should only protect the content of the instance.

Turn all nrefs members into atomic_int so they are thread-save themselves.

Make sure to only use ++ and -- operators and avoid non atomic combinations of reading and writing. Drop all the static *Unlink functions that are involved in this.

Make sure the data lock is not aquired while updating the nrefs members.

Technically the *Free() functions should *Link() the instance after the ref count has reached 0 and the decision to free it has been made. Otherwise the cleanup code linking and freeing to will lead to a recusions. Right now only rpmts needs that. SO it is omitted everywhere else.

FD_t is special in that fdFree return the instance after free iff it still has not reached a ref count of 0. Unfortunately code in rpmShowProgress() relies on that.

This does not make the data structures thread save on its own. But it gives a foundation on which data locking can be implemented.

@ffesti ffesti requested a review from a team as a code owner October 11, 2024 06:38
@ffesti ffesti requested review from dmnks and removed request for a team October 11, 2024 06:38
@pmatilai
Copy link
Member

pmatilai commented Oct 11, 2024

Whoa. Unfortunately it's not quite that simple, the increment is atomic but the comparison is not.

It's a step in the right direction though and we can take that step, but slice this up a bit: the places where the change is just a straightforward and obvious Unlink -> --nrefs change can go in one bigger commit, but do each of the more special cases and involved changes as a separate commits, each with their own explanations. Makes it easier to bisect it if something goes wrong.

Mind, this is not a review, just took a quick overview of it.

@pmatilai
Copy link
Member

And to make it clear: let's NOT get into the atomic compare-and-exchange stuff just here and now, this is a good intermediate step to that direction.

Other than that, I'll leave the review to @dmnks as per GH request, just commenting as I happened to notice this on the PR list 😅

@ffesti
Copy link
Contributor Author

ffesti commented Oct 11, 2024

To me https://en.cppreference.com/w/cpp/atomic/atomic/operator_arith reads like the value returned from the ++ and -- operators are indeed obtained atomically:

    operator++() performs atomic pre-increment. Equivalent to return fetch_add(1) + 1;.
    operator++(int) performs atomic post-increment. Equivalent to return fetch_add(1);.
    operator--() performs atomic pre-decrement. Equivalent to return fetch_sub(1) - 1;.
    operator--(int) performs atomic post-decrement. Equivalent to return fetch_sub(1);. 

But then there is all this talk about fences and memory_order that makes my head spin. So what do I know?

@pmatilai
Copy link
Member

But then there is all this talk about fences and memory_order that makes my head spin. So what do I know?

I thought we basically agreed to learn the ropes with the keyring+keys first, before changing anything else. I seem to recall that even being your suggestion 😅

@pmatilai
Copy link
Member

To me https://en.cppreference.com/w/cpp/atomic/atomic/operator_arith reads like the value returned from the ++ and -- operators are indeed obtained atomically:

Yes, that much is atomic. But the number of actual references can change between the returned information and the comparison, it's a separate instruction already. It's not like I eat this stuff for breakfast either, but AIUI handling this requires the atomic compare-exchange thing. Of the memory models I've no idea either.

@ffesti
Copy link
Contributor Author

ffesti commented Oct 14, 2024

Yes, this was an issue in the previous code as it did the comparison first and then decided what to do. This here is different. If nrefs reaches 0 it is no longer allowed to change as no one should have a reference. So if the -- returns zero there is no longer a race going on. So we can do the comparison later and then free the instance in peace.

If the -- does not return 0 it will for some other call to *Free. So if we decide not to free here it is still ensured someone else will. So even if the value of nrefs changes to 0 between the change and the comparison everything works out in the end.

@pmatilai
Copy link
Member

Oh... right 💡 Reaching zero is indeed special for our purposes here, that part I missed. Would be worth explicitly explaining in the commit message - it kinda touches on it in the "decision to free has been made" but it's not clear.

@pmatilai
Copy link
Member

This stuff really puts your brain through a twister 😄

@pmatilai
Copy link
Member

pmatilai commented Oct 14, 2024

From the commit message

Technically the *Free() functions should *Link() the instance after the
ref count has reached 0 and the decision to free it has been made.
Otherwise the cleanup code linking and freeing to will lead to a
recusions. This is omitted here as it is not needed.

This is actually rather confusing, because a destructor should not be mucking around with reference counts. If the reference count reached zero and we decided that it's time to kill it, then we are the last thing that can be accessing it, no?

If the rpmts stuff really needs these gimmicks then that's a special case needing the explanation indeed, but drop it from elsewhere.

@ffesti
Copy link
Contributor Author

ffesti commented Oct 14, 2024

Yes, this is confusing, still it is not wrong. If the destructor code itself e.g. uses an iterator to free up sub components that iterator may add and then free a reference. This leads to a loop that tries to free the item over and over again unless the destructor code holds a proper reference.

From an conceptual POV it is also correct that the destructor basically takes the last reference from the caller and holds it until it is finished with the clean up.

@pmatilai
Copy link
Member

pmatilai commented Oct 14, 2024

IFF the destructor does such things - it shouldn't really. A destructor certainly should not be allocating anything new.
That rpm may do so in someplaces doesn't make it right. I do remember rpmts cleanup being super funky though, so no surprises in that requiring special care - been there.

@ffesti
Copy link
Contributor Author

ffesti commented Oct 14, 2024

I'd rather say the destrutors of most of our data types are super simple so they don't run into this issue. If you do something (serious) with an instance you need a valid reference - even if you are the destructor.

@pmatilai
Copy link
Member

No, the point is that a destructor is not allowed to fail, so basically any destructor that needs to do allocations or references is broken. Like the rpmtsFree() path.

All I'm asking here is to drop the claim of linking being the right thing to do in any general case because it's not.

@ffesti
Copy link
Contributor Author

ffesti commented Oct 14, 2024

Added a "reaching zero" sentence to the commit messages.

@dmnks
Copy link
Contributor

dmnks commented Oct 14, 2024

If nrefs reaches 0 it is no longer allowed to change as no one should have a reference.

In other words, thread synchronization for the an object's lifecycle is still supposed to be handled explicitly by the programmer. This PR only addresses the nrefs counter so that it can't be racy. Is that correct?

@dmnks
Copy link
Contributor

dmnks commented Oct 14, 2024

If that's the case, how does the change from the unlink functions to the inline --nrefs > 0 statements relate to making nrefs atomic? The cleanup functions themselves are still racy.

@ffesti
Copy link
Contributor Author

ffesti commented Oct 14, 2024

Yes, this is just about nrefs and does not add thread safety for the content of the data types.

--nrefs decreases the nrefs and returns the new value atomically. As reaching 0 means no one has a reference it "can not" increase again. So checking the value later is fine. Note that we get a copy of the value returned from --nrefs so other changes to the nrefs attibute won't change the return value of the -- operator.

So during the -- operator the decision if made whether we need to free the instance. And we get a copy of that decision. So the *Free() function is not racy.

@dmnks
Copy link
Contributor

dmnks commented Oct 15, 2024

My point is, if two threads are running e.g. headerFree() while nrefs is 1, then both threads will fail the condition --nrefs > 0, even if the nrefs decrement itself is atomic, and thus proceed to the cleanup.

I guess your point is that you can't have two threads accessing a header object while nrefs is 1 because that's a programming error. That's of course true, but then, isn't that the point of reference counting in the first place that it prevents such programming errors?

It's likely that I just misunderstand the purpose of nrefs altogether here, so take that with a grain of salt...

@ffesti
Copy link
Contributor Author

ffesti commented Oct 15, 2024

Yes, this all only works if everyone is using *Link and *Free as pairs.

@dmnks
Copy link
Contributor

dmnks commented Oct 15, 2024

Yes, this all only works if everyone is using *Link and *Free as pairs.

OK, paired with some background on "why reference counting" (in rpm) from @pmatilai now, somehow this makes a lot more sense.

@dmnks
Copy link
Contributor

dmnks commented Oct 15, 2024

With that part clarified, the patches LGTM. I've made a tiny code style fixup so I guess I'll squash it and then merge.

@dmnks
Copy link
Contributor

dmnks commented Oct 15, 2024

Oh, and @pmatilai and @ffesti thanks for the patience (in explaining the refcount stuff to me)! 😄

Throughout the code base reference counting is done in a thread-unsave
way.

Turn nrefs members into atomic_int so they are thread-save
themselves.

Make sure to only use ++ and -- operators and avoid non atomic
combinations of reading and writing. As the nref can't change again
after reaching zero this is thread-save even as we look at the result
an instruction later. Drop all the static *Unlink functions.

Technically the *Free() functions should *Link() the instance after the
ref count has reached 0 and the decision to free it has been made.
Otherwise the cleanup code linking and freeing to will lead to a
recusions. This is omitted here as it is not needed.

This does not make the data structures thread save on its own. But it
gives a foundation on which further thread safety work can be based on.
Throughout the code base reference counting is done in a thread-unsave
way.

Turn nrefs into atomic_int so it is thread-save.

Make sure to only use ++ and -- operators and avoid non atomic
combinations of reading and writing.  As the nref can't change again
after reaching zero this is thread-save even as we look at the result
an instruction later. Drop all the static *Unlink functions.

Technically all the *Free() functions should *Link() the instance after the
ref count has reached 0 and the decision to free it has been made.
Otherwise the cleanup code linking and freeing to will lead to a
recusions. Right now only rpmts needs that. So we only do that here.

This does not make the data structures thread save on its own. But it
gives a good foundation.
Throughout the code base reference counting is done in a thread-unsave
way.

Turn nrefs into atomic_int so they are thread-save themselves. As the
nref can't change again after reaching zero this is thread-save even
as we look at the result an instruction later.

FD_t is special in that fdFree return the instance after free iff it
still has not reached a ref count of 0. Unfortunately code in
rpmShowProgress() relies on that.

This does not make the data structures thread save on its own. But it
gives a foundation on which data locking can be implemented.
These data structures use the content lock (aka mutex) to also protect the
reference counter. This is wrong and complicates freeing interconnected
data structures. This lock should only protect the content of the
instance.

Turn nrefs members into atomic_int so they are thread-save themselves.

Make sure to only use ++ and -- operators and avoid non atomic
combinations of reading and writing.  As the nref can't change again
after reaching zero this is thread-save even as we look at the result
an instruction later.

Make sure the data lock is not aquired for updating the nrefs members.

Technically the *Free() functions should *Link() the instance after the
ref count has reached 0 and the decision to free it has been made.
Otherwise the cleanup code linking and freeing to will lead to a
recusions. This is not neccessary here so it is omitted.
@dmnks dmnks merged commit a695776 into rpm-software-management:master Oct 15, 2024
1 check passed
@pmatilai
Copy link
Member

pmatilai commented Oct 15, 2024

And so, rpm enters the atomic age :atom:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants