CLI: Ability to write in columnar order #686

amacneil · 2022-10-28T00:10:11Z

Ability to create mcap files (probably using mcap merge cli) in columnar (i.e. channel oriented) order.

e.g.

no more than one channel per chunk
fully write each channel before writing the next channel

This would allow more efficient remote access of multi-channel files (readers can easily fetch an entire topic without scanning the whole file).

The text was updated successfully, but these errors were encountered:

james-rms · 2022-10-28T00:26:34Z

Preliminary thoughts:

MCAPs laid out in column order are difficult for the standard studio use-case: playing back several topics in time order, because a reader may need to decompress the entire chunk for each topic it needs before playing any messages, which could result in loading the entire uncompressed message content into memory.
This can be mitigated if the reader knows before decompressing that all messages in a given chunk are in time order internally. then it can decompress each chunk progressively while emitting messages rather than decompressing the entire chunk.
You can detect the above case by looking at the message index records after a given chunk.

wkalt · 2022-10-28T01:22:22Z

Additional notes -

We've discussed the strategy above, as well as an alternative of using attachments to attach topics intended for preloading to the end of the file. The way I imagine that would work would be an mcap optimize [file] [-t topic] command, which would scan the input file in linear order, write to memory or disk a new mcap file containing only the specified topics (in original file order), and attach that mcap to the end of the original mcap with a specific naming convention (which would probably need to go in the spec or some implementation notes eventually).

A reader can leverage this by reading the EOF index of the attachment to determine what topics are preloadable, and if the topic being sought is in that list, download that entire attachment prior to downloading the main file.

The advantages of this approach as I see them are,

It does not take any extra memory to read or write, vs the original file
Row-oriented read performance characteristics of the original file are entirely preserved, no matter how many channels in the original file.
It can be implemented without overwriting the original file (pending implementation of CLI: fast summary section rewrites #677)
It means all preloaded topics can be fetched with a single linear read of the attachment (the expectation being that typical topics that require preloading, for plots and things are not the heavy topics).

Disadvantages I see are,

If you try and attach large topics to the end of file attachment, you will suffer the same as in the original case (though this is also true of trying to preload large messages from the "column-based" layout)
It requires a new convention to be documented in the spec. It's unclear if this should be an mcap feature or a studio/foxglove-cli feature.
It is not easy for an automated process to know ahead of time what topics to include in this section. The selection would be org-specific, and may change over time within an org. So either heuristics would need to be applied in the command (which would probably benefit from specification: store per-topic size statistics #384) or else operationalizing it may be tricky for users. There is relatively little drawback to small topic overfetch. It's not unreasonable to think the typical size of one of these attachments will be in single digit megabytes or less. Once you get that small the overhead of requesting the data becomes comparable to the transfer time.

I think it would be worthwhile to try both approaches and see how they compare on local and remote files.

wkalt · 2022-10-28T02:45:19Z

here is a POC of the concept described above: #687

jhurliman · 2022-11-28T22:16:15Z

We've discussed a hybrid of row-oriented vs column-oriented, where topics over a given size threshold (is this threshold based on bytes/sec or % of file or total bytes in a file or message size?) are separated out into their own chunks and all smaller topics remain in interleaved chunks.

james-rms · 2022-12-20T08:36:23Z

Closing as we decided not to merge this into the mcap CLI tool.

amacneil added the feature New feature or request label Oct 28, 2022

james-rms self-assigned this Oct 28, 2022

jtbandes added the cli label Nov 11, 2022

james-rms mentioned this issue Nov 28, 2022

CLI: add columnize subcommand #747

Closed

james-rms closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: Ability to write in columnar order #686

CLI: Ability to write in columnar order #686

amacneil commented Oct 28, 2022

james-rms commented Oct 28, 2022 •

edited

Loading

wkalt commented Oct 28, 2022

wkalt commented Oct 28, 2022

jhurliman commented Nov 28, 2022

james-rms commented Dec 20, 2022

CLI: Ability to write in columnar order #686

CLI: Ability to write in columnar order #686

Comments

amacneil commented Oct 28, 2022

james-rms commented Oct 28, 2022 • edited Loading

wkalt commented Oct 28, 2022

wkalt commented Oct 28, 2022

jhurliman commented Nov 28, 2022

james-rms commented Dec 20, 2022

james-rms commented Oct 28, 2022 •

edited

Loading