Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading multiple files with SIO #647

Open
tmadlener opened this issue Jul 24, 2024 · 4 comments
Open

Support for reading multiple files with SIO #647

tmadlener opened this issue Jul 24, 2024 · 4 comments

Comments

@tmadlener
Copy link
Collaborator

Currently the SIOReader does not support automatic file switching.

@SaxenaAnushka102
Copy link
Contributor

SaxenaAnushka102 commented Sep 29, 2024

Hello @tmadlener! I'd love to work on this. Could you please provide more details on the expected implementation? From my understanding, the new functionality will involve defining methods in SIOReader.h and implementing them in SIOReader.cc for automatic file switching. Does this approach align with what's expected?

@tmadlener
Copy link
Collaborator Author

Hi @SaxenaAnushka102, this will involve quite a bit of c++ and after having had another look into it, it might not be as straight forward as I had originally anticipated. Nevertheless, let me lay out the basic pieces of work that would need to be done and then you can still decide if you want to embark on this :)

The main work will have to happen in the SIOReader indeed. Then there are a few minor things in other places to remove some "artificial" restrictions. We will also fix a small issue along the way


The main building blocks of the SIOReader are:

  • a sio::ifstream which is just a typedef for an std::ifstream filestream from where we read the data.
  • a SIOFileTOCRecord that bascially just contains a bunch of file positions. We have different Frame categories in podio and each category can have arbitrarily many entries in the file. The SIOFileTOCRecord simply maps all category names to all file positions for all entries.
  • a map that maps the name of the category to the entry number that has last been read for this category.

There are a few more members, but we don't care too much about them here.

The small issue

The SIOReader has a bool readFileTOCRecord(), that reads a table of content (TOC) record which is then used throughout to jump to different positions inside the file. However, we never check whether this function call actually succeeds in openFile.

podio/src/SIOReader.cc

Lines 23 to 24 in 8651fdd

// NOTE: reading TOC record first because that jumps back to the start of the file!
readFileTOCRecord();

There really should be a check there and if it returns false, then we need to throw a std::runtime_error telling us that we cannot use this file.


The actual work

The main part of the work will be to make sure that we switch files whenever necessary. The main challenge in this case will be that when reading these files, there can be different categories and it is possible that we read some categories much faster than we do others. Hence, it's possible that we run out of entries for one category in a file, while we still have entries for other categories in this file. Obviously, we still need to be able to go back to those entries. This means that we will have to keep track of which entries of each category are in which file and then potentially open another file when the current entry is not in the file that is currently open. We could in principle also keep multiple files open at the same time, but then we have to keep track of all of them and we will potentially have very many open file handles, which is something we would like to avoid.

The steps that we will have to take are:

  • Add some structure to keep track of which entry is in which file. This should then be usable similar to the current SIOFileTOCRecord, but instead of just returning the file position, it should return the correct ifstream and the position in that file. This will be populated once at the beginning, such that it can then simply be queried afterwards.
  • Add an openFiles method to the SIOReader that takes a const std::vector<std::string>& (aka the filenames) as inputs.
    • Implement the openFile method in terms of this one to avoid code duplication (this is also done, e.g. for the ROOTReader)
    • In openFiles we would simply open each file once, read its SIOFileTOCRecord and use that to populate the structure described in the first step. This would require that the readFileTOCRecord method returns a newly constructed SIOFileTOCRecord instead of populating an internal one.
  • For simplicity, we simply assume that the podio header and edm definitions are always the same in all the files and simply read them only from the first one (and ignore them in the subsequent files).

I think most of the functionality can be put into the track keeping structure, such that the SIOReader can be left almost untouched.

The structure that keeps track of which entry is in which file would look something like this (from an interface point of view)

struct EntryFileMap {
  EntryFileMap(const std::vector<std::string>& filenames, const std::vector<podio::SIOFileTocRecord>& tocRecords);
  std::tuple<sio::ifstream&, unsigned> getFileAndPosition(const std::string& category, unsigned iEntry) const;
};

For the implementation one would probably go with something like a

  std::vector<std::string> m_filenames; ///< All the file names
  sio::ifstream m_currentStream;  ///< The currently open stream / file
  unsigned m_currentStreamIdx;  ///< The index in the filenames vector of the currently open stream
  std::vector<SIOFileTocRecord> m_fileTOCRecords; ///< The toc records of each file

Then getFileAndPosition should be implementable in a fairly straight forward manner:

  • Figure out in which file the passed entry will be
  • Check if that file is the current file
    • If not: close the current file and open the appropriate one
  • Get the position in that file from the corresponding toc record
  • Return the filestream and the position

@SaxenaAnushka102
Copy link
Contributor

Hi @tmadlener,
Thanks so much for the detailed explanation! While it sounds like there's a bit more complexity than I initially anticipated, I'm still very interested in working on this and getting some hands-on experience with the SIOReader code.
To get started, I had a couple of questions for clarification:
In the new structure EntryFileMap, how will it handle situations where an entry might span multiple files?
For the openFiles method, is there a specific way we want to handle potential errors during the TOC record reading process for each file?
I'm happy to jump in and start working on the structure that keeps track of entries and files (potentially EntryFileMap). Do you have any specific guidance on the implementation approach for this part? Also, are there any project community calls that I can be a part of to know more & discuss this further?
Thanks,
Anushka.

@tmadlener
Copy link
Collaborator Author

Glad to hear it :) And obviously very happy to help along the way. To answer your specific questions:

In the new structure EntryFileMap, how will it handle situations where an entry might span multiple files?

Entries will always be fully within one file, unless there is a problem with the file. However, in that case there is not too much we can do in any case.

For the openFiles method, is there a specific way we want to handle potential errors during the TOC record reading process for each file?

For the first version we simply go the easiest way; If we find an issue in any of the files during TOC reading, we will just throw an exception, as there is again not too much we can do with faulty input files.

Do you have any specific guidance on the implementation approach for this part?

I would start with something a long the lines in my original comment. I am fairly certain it will need some refinement that we will discover throughout the development.

Also, are there any project community calls that I can be a part of to know more & discuss this further?

No, not really at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants