Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework high-level format variants #38

Closed
wkalt opened this issue Jan 21, 2022 · 2 comments
Closed

Rework high-level format variants #38

wkalt opened this issue Jan 21, 2022 · 2 comments
Assignees
Labels
feature New feature or request

Comments

@wkalt
Copy link
Contributor

wkalt commented Jan 21, 2022

The specification currently makes a division between "chunked" and "unchunked" files, with each having a mandatory set of fields. Discussions have leaned in the direction of this being too restrictive on at least a couple fronts:

  • Users may want the compression benefits of chunking, but not want the cost of retaining channel info records in RAM for the statistics or chunk index records.
  • Users of the unchunked format may not want the cost of retaining channel info records in RAM for the statistics record. That's part of what they are trying to avoid by using the unchunked variant to begin with.

In consideration of these, we are considering making the following changes:

  • Chunked and unchunked files are eliminated as terms. There will be just one "mcap file".
  • Chunks and messages may both appear at the top level of the file.
  • Chunk indexes, attachment indexes, statistics, and channel infos in the index data section are optional, but subject to some mutual constraints:
  • if chunk indexes are included, any channels referenced by those chunk indexes must have channel infos in the index data section
  • if the channel_stats field of the statistics record is included, any channels it references must be reflected in the index data section as channel infos
  • if there are no records in the index data section, the index_offset of the footer record will be set to zero. Otherwise it will point to the first record in the section, regardless of what kind of record that is.
  • the channel_stats field of the statistics record may be zero-length/empty. This is to allow tracking of cheap global file stats without the expense of retaining the channel infos.

Messages written outside chunks will be readable by a sequential reader, but invisible to a random access reader using the chunk index.

Writers that do not include data in the index section will progressively lose utility from the "fast summarization support". The algorithm for "summary" is roughly,

  • seek to the index_offset
  • read to the end of the file
  • report aggregated statistics

If the index data section is empty, no statistics will be aggregated. Fallback behavior to a full file read is inadvisable to maintain good support on remote files. Update the explanatory notes section to discuss this a little bit.

@wkalt wkalt added the feature New feature or request label Jan 21, 2022
@defunctzombie
Copy link
Contributor

There are two types of files - those with a 0 in the index_offset value for the footer and those with a non-zero value.

@defunctzombie
Copy link
Contributor

Messages written outside chunks will be readable by a sequential reader, but invisible to a random access reader using the chunk index.

If you are using chunk indices then messages must not appear outside of chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Development

No branches or pull requests

2 participants