-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI: add columnize subcommand #747
Conversation
fb459c3
to
ca4056c
Compare
If we end up baking this functionality into the tool, I'd propose calling it something more like "sort" or "group by" than columnize (or some combination of sort and group by - sort can also be used to get a physically time-ordered file, but some modifier seems needed to convey the partitioning into group-wise chunks, which doesn't apply in the case of a time sort). The reason is that storing something in a columnar format specifically means that columns (e.g fields in ROS schemas) would be stored contiguously, whereas this is still storing rows contiguously, just in a different order. The concept is similar but the answer to "is mcap a columnar format" would definitely be "no" in standard usage of the term. A row-oriented database stores rows in different contiguous tables just like the chunks here - but that doesn't make it columnar. In terms of implementation and thinking about this, I think memory usage is going to be the main thing to evaluate. The go reader (and presumably the others) will hold fully decompressed whatever chunks are backing the messages in its heap, so if you have hundreds (or thousand+) topics and you are recording large files (such that you have enough messages on your topics to make full chunks), you can end up needing gigabytes of RAM to read one of these. We have seen mcap files in the wild that are both in the 10s of GB in size, and that have over a thousand topics inside, so the scenario isn't hypothetical. The main issue that results from this IMO is that tooling/processes built for MCAP files needs to size memory allocations sufficiently to handle two very different kinds of requirements - so a process that handles large files on 1GB now might need 8GB or more if both types of input are anticipated. |
Closing for now. |
Public-Facing Changes
Description
Fixes #686