Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype version of Frame I/O #287

Merged
merged 54 commits into from
Sep 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b242087
Add the basic structure for the Frame class
tmadlener Mar 29, 2022
3e974ff
Add basic support for GenericParameters through Frame
tmadlener Mar 30, 2022
e27eca3
Make Frame a move-only type
tmadlener Mar 30, 2022
98dfddb
Restrict get/put interfaces to only accept collections
tmadlener Mar 30, 2022
dec9f30
Add first version of thread safety and tests
tmadlener Apr 1, 2022
2b60dcf
Fix non-movable mutex
tmadlener Apr 4, 2022
504433c
Fix clang-tidy warning
tmadlener Apr 6, 2022
e13ca81
Add test case for multithreaded parameter insertion
tmadlener Apr 7, 2022
b86a31d
Add mutexes guarding the internal maps for GenericParameters
tmadlener Apr 7, 2022
37c7151
Reorganize tests slightly to avoid Catch2 thread problem
tmadlener Apr 8, 2022
20c0ff0
Make earlier tests also use the name helper function
tmadlener Apr 8, 2022
d89c9dc
Capture thread index instead of passing it as an argument
tmadlener Apr 8, 2022
b6a9af9
Make it possible to put untyped CollectionBase into Frame
tmadlener May 30, 2022
152eda1
Update documentation
tmadlener Jun 15, 2022
f156655
Work around Catch2 assertion thread safety issues
tmadlener Jun 21, 2022
f689a11
Add constructors from buffers / CollectionData
tmadlener Apr 11, 2022
9cf719c
Make it possible to create empty CollectionBuffers
tmadlener Apr 11, 2022
e753a65
Make collections constructible from buffers
tmadlener Apr 15, 2022
380acfc
Add RawDataT constructor to Frame
tmadlener Apr 15, 2022
a139876
Add ROOTFrameReader that can read "old" files
tmadlener Apr 11, 2022
ab92315
Make read tests work with Frame as well
tmadlener Apr 21, 2022
6e9b2bc
Add reading of GenericParameters (aka EventMetaData)
tmadlener Apr 21, 2022
089a615
Split buffers into read / write buffers
tmadlener Apr 29, 2022
c2788b6
Make SIOFrameReader that can read "old" files
tmadlener May 5, 2022
4674da8
clang-tidy: Move constructor arguments into members
tmadlener May 9, 2022
b7c1d17
Avoid unnecessary copy
tmadlener May 9, 2022
07c3751
Add ROOTFrameWriter to write frames via ROOT
tmadlener May 12, 2022
771829e
Make sure to check the size of the user data vectors
tmadlener May 23, 2022
3012fdf
Make it possible to write different categories with ROOT
tmadlener May 30, 2022
a873165
Add writeFrame that writes the complete frame by default
tmadlener May 30, 2022
7af94b2
Add namespace read/write tests to Frame I/O
tmadlener May 30, 2022
4336d08
Move test code into header and template it
tmadlener May 31, 2022
c7ca452
Implement SIOFrameWriter and tests
tmadlener May 31, 2022
98aad36
Adapt the SIOFrameReader to read files written by the SIOFrameWriter
tmadlener Jun 7, 2022
bbd5392
Rename interface function to reflect Frame
tmadlener Jun 2, 2022
d71c221
Template read_frame function and fix ROOTFrameReader bug
tmadlener Jun 2, 2022
c2a6614
Fix test dependency
tmadlener Jun 8, 2022
e799504
Fix typo (although it is a curious one)
tmadlener Jun 9, 2022
56dd0ec
Remove no longer necessary const_cast
tmadlener Jun 9, 2022
fe56176
Remove empty FrameModel constructor
tmadlener Jun 10, 2022
edb1ab0
Mark SIOFrameReader as non-copyable
tmadlener Jun 10, 2022
553274d
Make reading frames well behaved for non-existing data
tmadlener Jun 23, 2022
9f5e10a
Move CollectionInfo type alias into detail namespace
tmadlener Jun 23, 2022
ac2969a
Rename function to make it less confusing and document it
tmadlener Jun 23, 2022
78c2909
Rename template parameter to be less confusing
tmadlener Jun 23, 2022
1b3d08f
Remove "category" parameter from public interfaces
tmadlener Jun 23, 2022
db51457
Rename RawData classes to FrameData classes for less confusion
tmadlener Jun 23, 2022
1493d7b
Reintroduce const specifier
tmadlener Jun 23, 2022
33649b8
Fix typos and remove commented code
tmadlener Jun 23, 2022
177cef5
Fix clang-tidy
tmadlener Jun 23, 2022
45fd22b
Remove no longer available collection
tmadlener Aug 2, 2022
0d41abc
Fix test environment for Ubuntu CI
tmadlener Aug 2, 2022
3d69484
Add Frame design and philosophy documentation
tmadlener Jun 24, 2022
e2b89fb
Make sure all sources are present in targets
tmadlener Sep 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions doc/frame.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# The `Frame` concept
The `podio::Frame` is a general data container for collection data of podio generated EDMs.
Additionally, it offers the functionality to store some (limited) data outside of an EDM.
The basic idea of the `Frame` is to give users of podio the possibility to organize EDM data into logical units and to potentially build a hierarchy of different `Frame`s.
Common examples would be the organisation of data into *Events* and *Runs*.
However, it is important to note that podio does really not impose any meaning on any `Frame` and each `Frame` is essentially defined by its contents.

## Basic functionality of a `Frame`
The main functionality of a `Frame` is to store and aggregate EDM collection data and it also offers the possibility to store some generic data alongside.
To ensure thread safety and const-correctness a `Frame` takes ownership of any data that is put into it and only gives read access to immutable data.
This is mandated by the interface for collection data (simplified here for better readability):
```cpp
struct Frame {
template<typename CollT>
const CollT& put(CollT&& coll, const std::string& name);

void put(std::unique_ptr<podio::CollectionBase> coll, const std::string& name);

template<typename CollT>
const CollT& get(const std::string& name) const;

template<typename T>
void putParameter(const std::string& name, T value);

template<typename T>
const T& getParameter(const std::string);
};
```
In this case there are two ways to get collection data into the `Frame`
1. By passing a concrete collection (of type `CollT`) into the `Frame` as an [`rvalue`](https://en.cppreference.com/w/cpp/language/value_category). There are two ways to achieve this, either by passing the return value of a function directly into `Frame::put` or by explicitly moving it in the call via `std::move` if you are using a named variable.
2. By passing a `std::unique_ptr` to a collection. Similar to the first case, this can either be the return value of a function call, or has to be done via `std::move` (as mandated by the `std::unique_ptr` interface).

In both cases, if you passed in a named variable, the user is left with a *moved-from object*, which has to be in a *valid but indefinite* state, and cannot be used afterwards.
Some compilers and static code analysis tools are able to detect the accidental usage of *moved-from* objects.

For putting in parameters the basic principle is very similar, with the major difference being, that for *trivial* types `getParameter` will actually return by value.

For all use cases there is some `enable_if` machinery in place to ensure that only valid collections and valid parameter types can actually be used.
These checks also make sure that it is impossible to put in collections without handing over ownership to the `Frame`.

### Usage examples for collection data
These are a few very basic usage examples that highlight the main functionality (and potential pitfalls).

#### Putting collection data into the `Frame`
In all of the following examples, the following basic setup is assumed:
```cpp
#include "podio/Frame.h"

#include "edm4hep/MCParticleCollection.h" // just to have a concrete example

// create an empty Frame
auto frame = podio::Frame();
```

Assuming there is a function that creates an `MCParticleCollection` putting the return value into the `Frame` is very simple
```cpp
edm4hep::MCParticleCollection createMCParticles(); // implemented somewhere else

// put the return value of a function into the Frame
frame.put(createMCParticles(), "particles");

// put the return value into the Frame but keep the const reference
auto& particles = frame.put(createMCParticles(), "moreParticles");
```

If working with named variables it is necessary to use `std::move` to put collections into the `Frame`.
The `Frame` will refuse to compile in case a named variable is not moved.
Assuming the same `createMCParticles` function as above, this looks like the following

```cpp
auto coll = createMCParticles();
// potentially still modify the collection

// Need to use std::move now that the collection has a name
frame.put(std::move(coll), "particles");

// Keeping a const reference is also possible
// NOTE: We are explicitly using a new variable name here
auto coll2 = createMCParticles();
auto& particles = frame.put(std::move(coll2), "MCParticles");
```
At this point only `particles` is in a valid and **defined** state.

#### Getting collection (references) from the `Frame`
Obtaining immutable (`const`) references to collections stored in the `Frame` is trivial.
Here we are assuming that the collections are actually present in the `Frame`.
```cpp
auto& particles = frame.get<edm4hep::MCParticleCollection>("particles");
```

### Usage for Parameters
Parameters are using the `podio::GenericParameters` class behind the scene.
Hence, the types that can be used are `int`, `float`, and `std::string` as well as as `std::vectors` of those.
For better usability, some overloads for `putParameter` exist to allow for an *in-place* construction, like, e.g.
```cpp
// Passing in a const char* for a std::string
frame.putParameter("aString", "a string value");

// Creating a vector of ints on the fly
frame.putParameter("ints", {1, 2, 3, 4});
```

## I/O basics and philosophy
podio offers all the necessary functionality to read and write `Frame`s.
However, it is not in the scope of podio to organize them into a hierarchy, nor
to maintain such a hierarchy. When writing data to file `Frame`s are written to
the file in the order they are passed to the writer. For reading them back podio
offers random access to stored `Frame`s, which should make it possible to
restore any hierarchy again. The Writers and Readers of podio are supposed to be
run on and accessed by only one single thread.

### Writing a `Frame`
For writing a `Frame` the writers can ask each `Frame` for `CollectionWriteBuffers` for each collection that should be written.
In these buffers the underlying data is still owned by the collection, and by extension the `Frame`.
This makes it possible to write the same collection with several different writers.
Writers can access a `Frame` from several different threads, even though each writer is assumed to be on only one thread.
For writing the `GenericParameters` that are stored in the `Frame` and for other necessary data, similar access functionality is offered by the `Frame`.

### Reading a `Frame`
When reading a `Frame` readers do not have to return a complete `Frame`.
Instead they return a more or less arbitrary type of `FrameData` that simply has to provide the following public interface.
```cpp
struct FrameData {
/// Get a (copy) of the internal collection id table
podio::CollectionIDTable getIDTable() const;

/// Get the buffers to construct the collection with the given name
std::optional<podio::CollectionReadBuffers> getCollectionBuffers(const std::string& name);

/// Get the still available, i.e. yet unpacked, collections from the raw data
std::vector<std::string> getAvailableCollections() const;

/// Get the parameters that are stored in the raw data
std::unique_ptr<podio::GenericParameters> getParameters();
};
```
A `Frame` is constructed with a (`unique_ptr` of such) `FrameData` and handles everything from there.
Note that the `FrameData` type of any I/O backend is a free type without inheritance as the `Frame` constructor is templated on this.
This splitting of reading data from file and constructing a `Frame` from it later has some advantages:
- Since podio assumes that reading is done single threaded the amount of time that is actually spent in a reader is minimized, as only the file operations need to be done on a single thread. All further processing (potential decompression, unpacking, etc.) can be done on a different thread where the `Frame` is actually constructed.
- It gives different backends the necessary freedom to exploit different optimization strategies and does not force them to conform to an implementation that is potentially detrimental to performance.
- It also makes it possible to pass around data from which a `Frame` can be constructed without having to actually construct one.
- Readers do not have to know how to construct collections from the buffers, as they are only required to provide the buffers themselves.

### Schema evolution
Schema evolution happens on the `CollectionReadBuffers` when they are requested from the `FrameData` inside the `Frame`.
It is possible for the I/O backend to handle schema evolution before the `Frame` sees the buffers for the first time.
In that case podio schema evolution becomes a simple check.

# Frame implementation and design
One of the main concerns of the `Frame` is to offer one common, non-templated, interface while still supporting different I/O backends and potentially different *policies*.
The "classic" approach would be to have an abstract `IFrame` interface with several implementations that offer the desired functionality (and their small differences).
One problem with that approach is that a purely abstract interface cannot have templated member functions. Hence, the desired type-safe behavior of `get` and `put` would be very hard to implement.
Additionally, policies ideally affect orthogonal aspects of the `Frame` behavior.
Implementing all possible combinations of behaviors through implementations of an abstract interface would lead to quite a bit of code duplication and cannot take advantage of the factorization of the problem.
To solve these problems, we chose to implement the `Frame` via [*Type Erasure*](https://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Type_Erasure).
This also has the advantage that the `Frame` also has *value semantics* in line with the design of podio.
5 changes: 4 additions & 1 deletion include/podio/CollectionBase.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,10 @@ class CollectionBase {
virtual unsigned getID() const = 0;

/// Get the collection buffers for this collection
virtual podio::CollectionBuffers getBuffers() = 0;
virtual podio::CollectionWriteBuffers getBuffers() = 0;

/// Create (empty) collection buffers from which a collection can be constructed
virtual podio::CollectionReadBuffers createBuffers() /*const*/ = 0;

/// check for validity of the container after read
virtual bool isValid() const = 0;
Expand Down
72 changes: 70 additions & 2 deletions include/podio/CollectionBuffers.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@

#include "podio/ObjectID.h"

#include <functional>
#include <memory>
#include <string>
#include <utility>
#include <vector>

namespace podio {

class CollectionBase;

template <typename T>
using UVecPtr = std::unique_ptr<std::vector<T>>;

Expand All @@ -20,16 +23,81 @@ using VectorMembersInfo = std::vector<std::pair<std::string, void*>>;
* Simple helper struct that bundles all the potentially necessary buffers that
* are necessary to represent a collection for I/O purposes.
*/
struct CollectionBuffers {
struct CollectionWriteBuffers {
void* data{nullptr};
CollRefCollection* references{nullptr};
VectorMembersInfo* vectorMembers{nullptr};

template <typename DataT>
std::vector<DataT>* dataAsVector() {
return asVector<DataT>(data);
}

template <typename T>
static std::vector<T>* asVector(void* raw) {
// Are we at a beach? I can almost smell the C...
return *static_cast<std::vector<DataT>**>(data);
return *static_cast<std::vector<T>**>(raw);
}
};

struct CollectionReadBuffers {
void* data{nullptr};
CollRefCollection* references{nullptr};
VectorMembersInfo* vectorMembers{nullptr};

using CreateFuncT = std::function<std::unique_ptr<podio::CollectionBase>(podio::CollectionReadBuffers, bool)>;
using RecastFuncT = std::function<void(CollectionReadBuffers&)>;

CollectionReadBuffers(void* d, CollRefCollection* ref, VectorMembersInfo* vec, CreateFuncT&& createFunc,
RecastFuncT&& recastFunc) :
data(d),
references(ref),
vectorMembers(vec),
createCollection(std::move(createFunc)),
recast(std::move(recastFunc)) {
}

CollectionReadBuffers() = default;
CollectionReadBuffers(const CollectionReadBuffers&) = default;
CollectionReadBuffers& operator=(const CollectionReadBuffers&) = default;

CollectionReadBuffers(CollectionWriteBuffers buffers) :
data(buffers.data), references(buffers.references), vectorMembers(buffers.vectorMembers) {
}

template <typename DataT>
std::vector<DataT>* dataAsVector() {
return asVector<DataT>(data);
}

template <typename T>
static std::vector<T>* asVector(void* raw) {
// Are we at a beach? I can almost smell the C...
return static_cast<std::vector<T>*>(raw);
}

CreateFuncT createCollection{};

// This is a hacky workaround for the ROOT backend at the moment. There is
// probably a better solution, but I haven't found it yet. The problem is the
// following:
//
// When creating a pointer to a vector<T>, either via new or via
// TClass::New(), we get a void*, that can be cast back to a vector with
//
// static_cast<vector<T>*>(raw);
//
// However, as soon as we pass that same void* to TBranch::SetAddress this no
// longer works and the actual cast has to be
//
// *static_cast<vector<T>**>(raw);
//
// To make it possible to always use the first form, after we leave the Root
// parts of reading, this function is populated in the createBuffers call of each
// datatype where we have the necessary type information (from code
// generation) to do the second cast and assign the result of that to the data
// field again.
RecastFuncT recast{};
};

} // namespace podio
Expand Down
17 changes: 16 additions & 1 deletion include/podio/CollectionIDTable.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#ifndef PODIO_COLLECTIONIDTABLE_H
#define PODIO_COLLECTIONIDTABLE_H

#include <memory>
#include <mutex>
#include <string>
#include <vector>
Expand All @@ -12,11 +13,20 @@ class CollectionIDTable {
public:
/// default constructor
CollectionIDTable() = default;
~CollectionIDTable() = default;

CollectionIDTable(const CollectionIDTable&) = delete;
CollectionIDTable& operator=(const CollectionIDTable&) = delete;
CollectionIDTable(CollectionIDTable&&) = default;
CollectionIDTable& operator=(CollectionIDTable&&) = default;

/// constructor from existing ID:name mapping
CollectionIDTable(std::vector<int>&& ids, std::vector<std::string>&& names) :
m_collectionIDs(std::move(ids)), m_names(std::move(names)){};

CollectionIDTable(const std::vector<int>& ids, const std::vector<std::string>& names) :
m_collectionIDs(ids), m_names(names){};

/// return collection ID for given name
int collectionID(const std::string& name) const;

Expand All @@ -43,10 +53,15 @@ class CollectionIDTable {
/// Prints collection information
void print() const;

/// Does this table hold any information?
bool empty() const {
return m_names.empty();
}

private:
std::vector<int> m_collectionIDs{};
std::vector<std::string> m_names{};
mutable std::mutex m_mutex{};
mutable std::unique_ptr<std::mutex> m_mutex{std::make_unique<std::mutex>()};
};

} // namespace podio
Expand Down
Loading