Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype version of Frame I/O #287

Merged
merged 54 commits into from
Sep 16, 2022
Merged

Prototype version of Frame I/O #287

merged 54 commits into from
Sep 16, 2022

Conversation

tmadlener
Copy link
Collaborator

@tmadlener tmadlener commented May 23, 2022

BEGINRELEASENOTES

  • Introduce the podio::Frame as a generalized, thread-safe (event) data container.
    • This first version offers all necessary functionality and an almost finalized interface, i.e. we plan to keep this as stable as possible, but we might still change things if it turns out that there are better ways to do some things
    • For details about the basic interface and the underlying design considerations please consult the corresponding documentation
  • This will be the only way to work with podio data starting from version 1.0
    • For now the current I/O implementations remain in place unchanged, but they will be deprecated (and removed) in the not too distant future

ENDRELEASENOTES

This is a prototype version of Frame based I/O with the necessary changes to the Frame and the collections that are necessary to enable this. In its current form all existing I/O functionality remains intact. This current version supports the following:

  • Writing and reading via ROOT. The resulting files are different than the ones written by the ROOTWriter.
    • The reading side should already be able to handle different Frame categories , while the writing part mainly has all the preparations for, but not actual support yet.
  • Writing and reading via SIO. The resulting files are different than the ones written by the SIOWriter.
    • For SIO the written Frames are self contained in the sense that they store all the necessary information for each frame which makes the writer/reader almost stateless.

From a usage perspective the Frame behaves very similar to the current EventStore for reading purposes. This can be seen by the few changes that have been necessary to read_test.h to accommodate for the simultaneous usage with both. The major differences between the two at this point are:

  • Interface to get the parameters (event meta data in the EventStore).
  • No direct support yet for collection meta data in the Frame.

For writing there are some conceptual differences that make a common implementation of the test code rather hard:

  • The EventStore hands out a reference to the collection and makes sure to clear it for every event
  • The Frame starts empty in every event and the collections are newly created and added to the Frame

Some more technicalities

Quite a few changes had to be introduced to make everything work, the most important ones are:

  • Make it possible to instantiate collections from buffers and to create collections "from thin air". In the current implementation this is done by introducing CollectionReadBuffers that have a createCollection method that is dynamically assigned during runtime (and the necessary code is generated by code generation). This is necessary to separate the reading of the buffers (and their potential schema evolution) from the actual creation of the collections later.
    • This also includes the possibility to create "empty" CollectionBuffers that can then be filled by I/O backends when reading.
  • The new root file format will contain a TTree for every Frame category. Additionally, there is one meta data tree that holds all the information that is necessary for reading everything.
    • For ROOT the contents of any given category is fixed by the contents of the first element of a category that is written, as that is used to initialize the TTree and all the branches.
  • There are quite a few "quick and dirty" solutions to get a working prototype ready:
    • Collections still have to be registered for write in the ROOTFrameReader. It is not yet entirely clear to me how the best interface looks for this. In the latest iteration the collections to write have become an argument to writeFrame (with a default to write all available collections).
    • The current implementation of the buffer and collection creation could probably be implemented slightly cleaner
    • The Frame has quite some additional (public) API surface at the moment for the writer to get access to the necessary data.

TODOs

@tmadlener tmadlener force-pushed the frame-io branch 2 times, most recently from acb2e41 to 45ccc2f Compare May 30, 2022 15:59
@tmadlener tmadlener changed the title Prototype version of Frame I/O with root Prototype version of Frame I/O Jun 8, 2022
include/podio/SIORawData.h Outdated Show resolved Hide resolved
include/podio/ROOTFrameReader.h Outdated Show resolved Hide resolved
include/podio/ROOTReader.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@hegner hegner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. I only have minor comments so far.
What we should make sure is to have a decent internal documentation about the concepts there.

Could you work on the TODOs listed in the PR? Namely some docs and release notes? The unpacking we can put into another ticket. Thanks!

void openFiles(const std::vector<std::string>& filenames);

/// Read all collections requested
std::unique_ptr<podio::ROOTRawData> readNextFrame(const std::string& category);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what behaviour do you want to have in case of end of iteration? The doc string should reflect that

std::unique_ptr<podio::ROOTRawData> readNextFrame(const std::string& category);

/// Returns number of entries for the given category
unsigned getEntries(const std::string& category) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the term category is a bit jargon. You should explain a bit better what that is/

include/podio/ROOTFrameReader.h Outdated Show resolved Hide resolved
* Helper struct to group together all the necessary state to read / process a
* given category.
*/
struct CategoryInfo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again - explain "category"

include/podio/ROOTFrameReader.h Outdated Show resolved Hide resolved
include/podio/SIOFrameWriter.h Show resolved Hide resolved
include/podio/SIORawData.h Outdated Show resolved Hide resolved
python/templates/SIOBlock.h.jinja2 Outdated Show resolved Hide resolved
src/ROOTFrameReader.cc Outdated Show resolved Hide resolved
* The name of the meta data tree in podio ROOT files. This tree mainly stores
* meta data that is necessary for ROOT based I/O.
*/
constexpr static auto metaTreeName = "podio_metadata";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure such information is ROOT specific. It is part of the file schema and could maybe go elsewhere

podio::version::Version currentFileVersion() const {
return m_fileVersion;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns stored collections in this category
std::map<std::string, detail::CollectionInfo> storedClasses(const std::string& catName);


return brNames;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::map<std::string, detail::CollectionInfo> ROOTFrameReader::storedClasses(const std::string& catName) {
std::map<std::string, detail::CollectionInfo> colls_info;
if (auto& catInfo = getCategoryInfo(catName); catInfo.table) {
for (const auto& [col, col_info] : catInfo.storedClasses)
colls_info[col] = col_info;
}
return colls_info;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmadlener Very nice work around these frames! May I suggest this addition to ease their integration into consuming workflows? This allows the access to the TClass objects as retrieved from the TChain on readout, and thus allows retrieving a human- (and RDataSource-) readable version of each collection type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @forthommel,

The detail::CollectionInfo are currently really an implementation detail of the ROOTFrameReader and I am not entirely sure if they should be exposed in an interface in their current form. Definitely not while they are still in the detail namespace. What is the information that is actually necessary for RDataSource to work nicely here? I would prefer a solution that only exposes what is really necessary instead of spilling everything here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tmadlener, and thanks for the feedback!
What my scheme would need is the collection/category type with name catName, as stored in the TChain. Unless it is already exposed elsewhere in the API and I missed it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing in the API yet that exposes any of those details, as the currently foreseen EDM API does not need anything like this. Additionally, it gives podio the freedom to treat I/O implementations as a detail instead of making e.g. the Branch layout in the root files part of the API.

IIUC, then you would need the XYZData (resp. vectors of them), right?

I would suggest to move that discussion/implementation to another PR entirely, because this one is (too) large already and we would actually like to keep this one in its current form to make it possible to implement Schema evolution on top of this before iterating further.

- Make sure that frames can be read in different order than the on they
have been written in
- Make all Frame constructors go through that
- Makes sure that the EmptyRawData actually needs to complete RawData interface
Would need to be a public type of the ROOTFrameReader otherwise and
would thus essentially leak an implementation detail.
- readNextFrame implies it returns a Frame which it does not
Clarify what it refers to in the implementation
Otherwise might pick up podio from LCG which seems to deadlock
@tmadlener
Copy link
Collaborator Author

@hegner as discussed briefly today:

  • Added documentation and some release notes (the link in the release notes will become valid once the PR is actually merged)
  • Rebased this onto the latest master and fix some minor issues with CMake to make sure everything compiles again (these were introduced in Add podio-dump tool to dump contents of podio files #323)

As far as a first version goes this is complete and can be reviewed, resp. considered stable for schema evolution purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants