Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddressSanitizer: heap-use-after-free in object destructor #492

Closed
veprbl opened this issue Sep 27, 2023 · 9 comments · Fixed by #514
Closed

AddressSanitizer: heap-use-after-free in object destructor #492

veprbl opened this issue Sep 27, 2023 · 9 comments · Fixed by #514
Assignees

Comments

@veprbl
Copy link
Contributor

veprbl commented Sep 27, 2023

  • OS version: Linux or macOS
  • Compiler version: gcc version 11.3.0
  • PODIO version: main or older releases
  • Reproduced by:
CXXFLAGS=-fsanitize=address LDFLAGS=-fsanitize=address cmake -DCMAKE_BUILD_TYPE=Debug -S . -B build
cmake --build build
build/tests/unittests/unittest_podio
  • Input: none
  • Output:
Randomness seeded to: 3728155408
=================================================================
==3061454==ERROR: AddressSanitizer: heap-use-after-free on address 0x606000020068 at pc 0x7feda4d2a6ec bp 0x7ffcfea450a0 sp 0x7ffcfea45098
READ of size 4 at 0x606000020068 thread T0
   #0 0x7feda4d2a6eb in podio::ObjBase::release() ./include/podio/ObjBase.h:27
   #1 0x7feda4d2a6eb in ExampleCluster::~ExampleCluster() ./tests/src/ExampleCluster.cc:48
   #2 0x7feda4d8ee13 in ExampleWithOneRelationObj::~ExampleWithOneRelationObj() ./tests/src/ExampleWithOneRelationObj.cc:23
   #3 0x7feda4d8ee48 in ExampleWithOneRelationObj::~ExampleWithOneRelationObj() ./tests/src/ExampleWithOneRelationObj.cc:24
   #4 0x7feda4d8fab9 in ExampleWithOneRelationCollectionData::clear(bool) ./tests/src/ExampleWithOneRelationCollectionData.cc:50
   #5 0x7feda4d81a2c in ExampleWithOneRelationCollection::~ExampleWithOneRelationCollection() ./tests/src/ExampleWithOneRelationCollection.cc:40
   #6 0x7feda4d81a88 in ExampleWithOneRelationCollection::~ExampleWithOneRelationCollection() ./tests/src/ExampleWithOneRelationCollection.cc:41
   #7 0x7feda5a56722 in podio::EventStore::~EventStore() ./src/EventStore.cc:14
   #8 0x4b6384 in CATCH2_INTERNAL_TEST_6 ./tests/unittests/unittest.cpp:105
   #9 0x557269 in Catch::RunContext::invokeActiveTestCase() (./build/tests/unittests/unittest_podio+0x557269)
   #10 0x5576bc in Catch::RunContext::runCurrentTest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (./build/tests/unittests/unittest_podio+0x5576bc)
   #11 0x557b11 in Catch::RunContext::runTest(Catch::TestCaseHandle const&) (./build/tests/unittests/unittest_podio+0x557b11)
   #12 0x560bd4 in Catch::Session::runInternal() (./build/tests/unittests/unittest_podio+0x560bd4)
   #13 0x560edb in Catch::Session::run() (./build/tests/unittests/unittest_podio+0x560edb)
   #14 0x43d3eb in main (./build/tests/unittests/unittest_podio+0x43d3eb)
   #15 0x7feda342924d in __libc_start_call_main (/nix/store/bzd91shky9j9d43girrrj6vmqlw7x9m8-glibc-2.35-163/lib/libc.so.6+0x2924d)
   #16 0x7feda3429308 in __libc_start_main_impl (/nix/store/bzd91shky9j9d43girrrj6vmqlw7x9m8-glibc-2.35-163/lib/libc.so.6+0x29308)
   #17 0x43d484 in _start (./build/tests/unittests/unittest_podio+0x43d484)

0x606000020068 is located 8 bytes inside of 64-byte region [0x606000020060,0x6060000200a0)
freed by thread T0 here:
   #0 0x7feda50b3db7 in operator delete(void*, unsigned long) (/nix/store/4v2bk6almk03mfnz4122dfz8vcxynvs3-gcc-11.3.0-lib/lib/libasan.so.6+0xb3db7)
   #1 0x7feda4d3bf39 in ExampleClusterCollectionData::clear(bool) ./tests/src/ExampleClusterCollectionData.cc:74
   #2 0x61500000287f  (<unknown module>)

previously allocated by thread T0 here:
   #0 0x7feda50b2f37 in operator new(unsigned long) (/nix/store/4v2bk6almk03mfnz4122dfz8vcxynvs3-gcc-11.3.0-lib/lib/libasan.so.6+0xb2f37)
   #1 0x7feda4d25c79 in ExampleClusterCollection::create() ./tests/src/ExampleClusterCollection.cc:82
   #2 0x7ffcfea4581f  ([stack]+0x1d81f)
   #3 0x7ffcfea461bf  ([stack]+0x1e1bf)

SUMMARY: AddressSanitizer: heap-use-after-free ./include/podio/ObjBase.h:27 in podio::ObjBase::release()
Shadow bytes around the buggy address:
 0x0c0c7fffbfb0: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
 0x0c0c7fffbfc0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
 0x0c0c7fffbfd0: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd
 0x0c0c7fffbfe0: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
 0x0c0c7fffbff0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
=>0x0c0c7fffc000: fd fd fd fd fd fd fd fd fa fa fa fa fd[fd]fd fd
 0x0c0c7fffc010: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
 0x0c0c7fffc020: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
 0x0c0c7fffc030: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd
 0x0c0c7fffc040: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
 0x0c0c7fffc050: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
 Addressable:           00
 Partially addressable: 01 02 03 04 05 06 07 
 Heap left redzone:       fa
 Freed heap region:       fd
 Stack left redzone:      f1
 Stack mid redzone:       f2
 Stack right redzone:     f3
 Stack after return:      f5
 Stack use after scope:   f8
 Global redzone:          f9
 Global init order:       f6
 Poisoned by user:        f7
 Container overflow:      fc
 Array cookie:            ac
 Intra object redzone:    bb
 ASan internal:           fe
 Left alloca redzone:     ca
 Right alloca redzone:    cb
 Shadow gap:              cc
==PID==ABORTING

It looks like there are some issues with lifetimes (R in Rust stands for Rewrite) when Object tries to update ref counter in the destructor:

{{ full_type }}::~{{ full_type }}() {
if (m_obj) {
m_obj->release();
}
}

cc @wdconinc @nathanwbrei

@tmadlener
Copy link
Collaborator

This looks very similar to #174 (with a more thorough discussion as well).

There are a few tests that we explicitly exclude when we run with AddressSanitizer (marked with ASAN_FAIL), e.g.

TEST_CASE("Clearing", "[UBSAN-FAIL][ASAN-FAIL][THREAD-FAIL][basics][memory-management]") {

We haven't fixed this yet because we were (apparently wrongly) convinced that in normal usage it should not occur. The main reason this occurs is, if an object is kept around longer than the collection it belongs to. One of the usual cases where I encounter this (semi-regularly) is if objects are used in maps or other containers outside of their collections. If these containers are cleared only at the beginning of the next event things might break and containers would need to be cleared at the end of the current event.

We have a draft solution in #220 (which would need reviving) that solves some but not all of the issues.

@tmadlener
Copy link
Collaborator

This will actually also happen in normal usage, I think. Since the order in which elements are destroyed matters in this case, and we cannot guarantee any well defined order (nor is that even possible for types that can form cyclic relations). Also it makes running anything with an Address sanitizer a PITA, so I will try to have a look at this.

Referencing #250 here just to have this also show up there.

@veprbl
Copy link
Contributor Author

veprbl commented Oct 8, 2023

I think we were able to mitigate the issue by using unlink(), similar to what is done for collections. That relies on a few behaviors of PODIO, but at least gets us unblocked for now.

Thanks to your pointers, I appreciate the problem a bit. It makes me wonder if an appropriate stopgap measure would be to do refcounting unconditionally. Then one could check refs in collection deallocator and throw some helpful exceptions? The issue is that use-after-free does not necessarily lead to a segfault, which makes such bugs a bit hard to find and fix. Otherwise, we can close this issue as it has served its purpose.

@veprbl
Copy link
Contributor Author

veprbl commented Oct 25, 2023

Actually, it didn't get resolved, the issue has moved into the collection destructor itself:

         `- edm4eic::ReconstructedParticle::~ReconstructedParticle() (0x2ba6198da0b8)
          `- edm4eic::MCRecoParticleAssociationObj::~MCRecoParticleAssociationObj() (0x2ba619976968)
           `- edm4eic::MCRecoParticleAssociationObj::~MCRecoParticleAssociationObj() (0x2ba6199769b9)
            `- edm4eic::MCRecoParticleAssociationCollectionData::clear(bool) (0x2ba619979af3)
             `- edm4eic::MCRecoParticleAssociationCollection::~MCRecoParticleAssociationCollection() (0x2ba619976db9)
              `- edm4eic::MCRecoParticleAssociationCollection::~MCRecoParticleAssociationCollection() (0x2ba619976ff9)

Moreover, I'm seeing occasional failures when running DD4hep ctest during build on macOS:

### CAUGHT SIGNAL: 11 ### address: 0x7f88e35c9f18,  signal =  SIGSEGV, value =   11, description = segmentation violation. Address not mapped to object.
Backtrace:
[PID=90690, TID=-1][ 0/41]> 0   ???                                 0x0000000115a1df87 0x0 + 4657897351
[PID=90690, TID=-1][ 1/41]> 1   libedm4hep.dylib                    0x000000012aaab3b5 _ZN7edm4hep22CaloHitContributionObjD0Ev + 37
[PID=90690, TID=-1][ 2/41]> 2   libedm4hep.dylib                    0x000000012aaabc23 _ZN7edm4hep33CaloHitContributionCollectionData5clearEb + 643
[PID=90690, TID=-1][ 3/41]> 3   libedm4hep.dylib                    0x000000012aaa5e37 _ZN7edm4hep29CaloHitContributionCollectionD2Ev + 39
[PID=90690, TID=-1][ 4/41]> 4   libedm4hep.dylib                    0x000000012aaa5ece _ZN7edm4hep29CaloHitContributionCollectionD0Ev + 14
[PID=90690, TID=-1][ 5/41]> 5   libDDG4EDM4HEP.1.26.dylib           0x000000012a8acf1a _ZN5podio5Frame10FrameModelINS_6detail14EmptyFrameDataEED2Ev + 282
[PID=90690, TID=-1][ 6/41]> 6   libDDG4EDM4HEP.1.26.dylib           0x000000012a8ac5de _ZN5podio5Frame10FrameModelINS_6detail14EmptyFrameDataEED0Ev + 14
[PID=90690, TID=-1][ 7/41]> 7   libDDG4EDM4HEP.1.26.dylib           0x000000012a8a8335 _ZN6dd4hep3sim20Geant4Output2EDM4hep6commitERNS0_18Geant4OutputAction13OutputContextI7G4EventEE + 773
[PID=90690, TID=-1][ 8/41]> 8   libDDG4.1.26.dylib                  0x0000000121a546ee _ZN6dd4hep3sim18Geant4OutputAction3endEPK7G4Event + 430
[PID=90690, TID=-1][ 9/41]> 9   libDDG4.1.26.dylib                  0x0000000121a1e2cc _ZN6dd4hep3sim25Geant4EventActionSequence3endEPK7G4Event + 172
[PID=90690, TID=-1][10/41]> 10  libDDG4.1.26.dylib                  0x0000000121a1ea2d _ZN6dd4hep3sim21Geant4UserEventAction16EndOfEventActionEPK7G4Event + 45

@tmadlener tmadlener self-assigned this Oct 30, 2023
@tmadlener
Copy link
Collaborator

I am not sure I understand the DD4hep failure fully yet. From having a look into the code and the stacktrace it seems to me this could be a different issue with the same symptoms. Effectively what happens in Geant4OutputEDM4hep::commit is moving all collections that are held internally in various maps into a Frame, clearing the maps and then resetting the Frame. None of this should fail and as far as I can see, there is nothing in the Geant4OutputEDM4hep that holds on to EDM4hep objects at this point. This almost looks like a moved-from collection destroys more than it should, which then makes the actual destructor trip. I will check this hypothesis with a smaller reproducer.

For the original issue, is it possible to provide some more context in which this happens? E.g. the full stacktrace or some highlevel description and some potential us of maps that have objects or collections as keys or values? I am re-surrecting #220, and I am trying to get to a more complete collection of actual use cases that I have to consider for this.

@veprbl
Copy link
Contributor Author

veprbl commented Oct 30, 2023

The original issue was, to my understanding, triggered like so: eic/EICrecon#949 (comment)

@tmadlener
Copy link
Collaborator

Yes that looks more or less the same as what I have described in #174, i.e. an object that points into a collection, outlives the collection and the frame in which the collection lives. I suppose this object would only be need to used as long as the Frame lives, i.e. it is only necessary to properly destruct it without invoking use-after-free, after the frame is gone?

@veprbl
Copy link
Contributor Author

veprbl commented Oct 30, 2023

Yes. Hence see the unlink solution, that did work, although another, more rare, crash appeared on our radar.

@tmadlener
Copy link
Collaborator

Thanks for the clarifications. I think these crashes are in the end all caused by the same underlying issue and we should be able to fix that. I am not yet sure about the best solution, but I have put this more or less on the top of our todo list as I would like to fix this before v1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants