Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: Ability to write in columnar order #686

Closed
amacneil opened this issue Oct 28, 2022 · 5 comments
Closed

CLI: Ability to write in columnar order #686

amacneil opened this issue Oct 28, 2022 · 5 comments
Assignees
Labels
cli feature New feature or request

Comments

@amacneil
Copy link
Contributor

Ability to create mcap files (probably using mcap merge cli) in columnar (i.e. channel oriented) order.

e.g.

  • no more than one channel per chunk
  • fully write each channel before writing the next channel

This would allow more efficient remote access of multi-channel files (readers can easily fetch an entire topic without scanning the whole file).

@amacneil amacneil added the feature New feature or request label Oct 28, 2022
@james-rms
Copy link
Collaborator

james-rms commented Oct 28, 2022

Preliminary thoughts:

  • MCAPs laid out in column order are difficult for the standard studio use-case: playing back several topics in time order, because a reader may need to decompress the entire chunk for each topic it needs before playing any messages, which could result in loading the entire uncompressed message content into memory.
  • This can be mitigated if the reader knows before decompressing that all messages in a given chunk are in time order internally. then it can decompress each chunk progressively while emitting messages rather than decompressing the entire chunk.
  • You can detect the above case by looking at the message index records after a given chunk.

@wkalt
Copy link
Contributor

wkalt commented Oct 28, 2022

Additional notes -

We've discussed the strategy above, as well as an alternative of using attachments to attach topics intended for preloading to the end of the file. The way I imagine that would work would be an mcap optimize [file] [-t topic] command, which would scan the input file in linear order, write to memory or disk a new mcap file containing only the specified topics (in original file order), and attach that mcap to the end of the original mcap with a specific naming convention (which would probably need to go in the spec or some implementation notes eventually).

A reader can leverage this by reading the EOF index of the attachment to determine what topics are preloadable, and if the topic being sought is in that list, download that entire attachment prior to downloading the main file.

The advantages of this approach as I see them are,

  • It does not take any extra memory to read or write, vs the original file
  • Row-oriented read performance characteristics of the original file are entirely preserved, no matter how many channels in the original file.
  • It can be implemented without overwriting the original file (pending implementation of CLI: fast summary section rewrites #677)
  • It means all preloaded topics can be fetched with a single linear read of the attachment (the expectation being that typical topics that require preloading, for plots and things are not the heavy topics).

Disadvantages I see are,

  • If you try and attach large topics to the end of file attachment, you will suffer the same as in the original case (though this is also true of trying to preload large messages from the "column-based" layout)
  • It requires a new convention to be documented in the spec. It's unclear if this should be an mcap feature or a studio/foxglove-cli feature.
  • It is not easy for an automated process to know ahead of time what topics to include in this section. The selection would be org-specific, and may change over time within an org. So either heuristics would need to be applied in the command (which would probably benefit from specification: store per-topic size statistics #384) or else operationalizing it may be tricky for users. There is relatively little drawback to small topic overfetch. It's not unreasonable to think the typical size of one of these attachments will be in single digit megabytes or less. Once you get that small the overhead of requesting the data becomes comparable to the transfer time.

I think it would be worthwhile to try both approaches and see how they compare on local and remote files.

@wkalt
Copy link
Contributor

wkalt commented Oct 28, 2022

here is a POC of the concept described above: #687

@jhurliman
Copy link
Contributor

We've discussed a hybrid of row-oriented vs column-oriented, where topics over a given size threshold (is this threshold based on bytes/sec or % of file or total bytes in a file or message size?) are separated out into their own chunks and all smaller topics remain in interleaved chunks.

@james-rms
Copy link
Collaborator

Closing as we decided not to merge this into the mcap CLI tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli feature New feature or request
Development

Successfully merging a pull request may close this issue.

5 participants