Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge EventStore #331

Closed
faustus123 opened this issue Sep 8, 2022 · 5 comments
Closed

Merge EventStore #331

faustus123 opened this issue Sep 8, 2022 · 5 comments

Comments

@faustus123
Copy link

!!!IMPORTANT!!!: To facilitate faster and easier response to your issue please provide in addition to the description of the issue also the following information

  • OS version: Linux (all)
  • Compiler version: gcc 11.3
  • PODIO version: 00-15
  • Reproduced by: Compile attached program, see instructions at top of podio_merge.cc
  • Input:
  • Output:
  • Goal: I would like a simple utility that can take events from multiple podio files and merge the collections into a single output file. The application would be for simulated data. I would like to add 1 signal event and N background events together. It would be a quick and dirty way to include backgrounds in the simulation prior to running reconstruction.

The attached code sort of works. It will produce the output file with merged sets of objects from all collections. However, it also blows up the number of references (branches with "#" in their names). Whatever it is doing there, is causing the program to run VERY slow (~1Hz) and produce a file that is about x10 larger than expected.

I realize handling the references properly is probably not trivial. It would be great if this could be done. If not, another alternative might be to have a way to drop all the references altogether.

Or, if you have a quicker way to achieve this ...

tmp.tgz

@faustus123
Copy link
Author

So this turns out to expose a very significant issue. At this point, we cannot write events to a podio file faster than about 0.1Hz. Thus, even our basic reconstruction code is completely hobbled from using podio as the output at the moment. We have burned a lot of time on trying to get this to work so really need so help figuring out what we are doing wrong. The code attached to the original post should illustrate the issue well.

@tmadlener
Copy link
Collaborator

tmadlener commented Sep 20, 2022

If I understand correctly your goal is to essentially merge several files (with the "same" contents). In that case the simplest solution should be

ROOTReader reader;
reader.openFiles(<filenames>); // The ROOTReader can handle several files
EventStore store;
store.setReader(&reader);
ROOTWriter writer("output.root", &store);
// register all collections for writing here via their name like you do currently with the collection ID table

for (int i = 0; i < reader.getEntries(); ++i) {
  writer.writeEvent();
  store.clear();
  reader.endOfEvent();
}
writer.finish();
reader.closeFile();

Did you try this already and run into issues? If yes, what were the issues you ran into?

Alternatively, the Frame (#287) has been merged in a first version now and is (sort of) a generalized thread-safe EventStore, that should make it much easier to have different Events on different threads for example.

@faustus123
Copy link
Author

faustus123 commented Sep 20, 2022

Thanks for this. I'm not quite sure if I understand how the reader gets connected with the EventStore used for writing above. Are you missing a store.setReader(reader) call? We also wanted to be able to add more than one event from the background file to a single event from the signal file. For example, we may want to add one event from the first file in <filenames> to 3 events from the second file.

This may be getting a little bit off track though since the primary issue is actually the same one brought up during our e-mail exchange that started back on July 24th. We never followed through on a meeting since I had to move on to other pressing issues with setting up EIC reconstruction and decided to defer this until later.

Here is recap of the issue:

In our reconstruction framework, we deal with individual objects. Objects are managed internally via std::vector containers of pointers (i.e. we do not use the Collection classes provided by podio as the containers). We also have our own class for managing these vectors that serves the purpose of the EventStore. The problem is how to write to a file objects we have in the form of std::vector<edm4hep::Cluster*>?

It seems the only supported mechanism is to clone them and add the clones to a edm4hep::ClusterCollection that is in an EventStore like this:

auto& collection = store->get<PodioCollectionT>(collection_name);
for (auto t : tobjs) {
   collection.push_back( t->clone() );
}

The process of cloning seems to be where the real problem lies. This process is incredibly slow and the number of references that end up in the output file balloons compared to what was in the input file. Even if we just do a straight copy using the above clone() mechanism. (File size blows up by order of magnitude).

Note that some of the edm4hep objects we want to write come from an input file so do already belong to a Collection which itself belongs to the input EventStore. At the point where we want to write objects to a file, we fill an output EventStore with both objects that were created by the reconstruction algorithms and copies of objects that came from the original podio input file. Thus, there is a symmetry in that all objects we want to write to the output file come to our Writer class as `std::vector<const edm4hep::X *>

If you want to look at the latest iteration we have of the code you can see it at this link. There is a lot of noise that may be a bit confusing, but Line 27 is where the real issue lies. (n.b. tobjs will be something like std::vector<const edm4hep::Cluster*>)

https://github.com/eic/EICrecon/blob/ddae3ffefc01394641436434e24eb22f52a4b9b9/src/services/io/podio/JEventProcessorPODIO.cc#L27

@tmadlener
Copy link
Collaborator

Are you missing a store.setReader(reader) call?

I most certainly am. Sorry about that. I will edit the above snippet to also include that.

We also wanted to be able to add more than one event from the background file to a single event from the signal file. For example, we may want to add one event from the first file in <filenames> to 3 events from the second file.

IIUC here, you do not want to merge these 4 events into one event, but rather have them stil as separate events, but with the sequence sig0, bkg0, bkg1, bkg2, sig1, bkg3, ...? That is something that should be more easily doable with the Frame as there you actually have the complete event contents contained in one place. If you want to merge the events into one event, that can become non-trivial depending on the event contents. E.g. merging collections of raw hits is usually not just a simple merging of the different collections, but you will also have to take into account effects from the readout electronics (e.g. saturation).

This may be getting a little bit off track though since the primary issue is actually the same one brought up during our e-mail exchange that started back on July 24th. We never followed through on a meeting since I had to move on to other pressing issues with setting up EIC reconstruction and decided to defer this until later.

I have put this onto the list of items to discuss at our joint meeting later today.

Brief introduction to Frame

Since we have this thread going already, let me try to give a very brief introduction to the Frame (brief documentation) here.

In essence the Frame is just a container to hold XYZCollections (and some "meta data"). The main point is that either the Frame owns the data or you own the data, and if you put things into the Frame you give up ownership and hand it to the Frame. In principle this would allow the following workflow in a FW: The FW controls the ownership of all the data, right up to the point where you want to write it, there you "simply convert" your vector<XYZ> into XYZCollection (we would have to introduce some functionality to make this easier than it is currently), give that collection to the Frame and then write that.

However, IIUC, that still would not solve your issue, right? Because you would like to hold on to the ownership in Jana2 and not give that over to the Frame, because that would essentially mean that you cannot write "intermediate" files, but could only start to write output files once the complete processing is done. On the other hand the Frame supports being written by multiple different writers, which can be configured to write different parts of the Frame, so essentially this would make it possible to simply shift the "intermediate" writing to the end of the processing and in the end have the same results.

@tmadlener
Copy link
Collaborator

Closing this as it seems to be no longer relevant with the introduction of the Frame and the removal of the EventStore. If this is still an issue feel free to re-open this one or create a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants