Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: add columnize subcommand #747

Closed
wants to merge 12 commits into from
Closed

CLI: add columnize subcommand #747

wants to merge 12 commits into from

Conversation

james-rms
Copy link
Collaborator

@james-rms james-rms commented Nov 25, 2022

Public-Facing Changes

Description

Fixes #686

@wkalt
Copy link
Contributor

wkalt commented Nov 26, 2022

If we end up baking this functionality into the tool, I'd propose calling it something more like "sort" or "group by" than columnize (or some combination of sort and group by - sort can also be used to get a physically time-ordered file, but some modifier seems needed to convey the partitioning into group-wise chunks, which doesn't apply in the case of a time sort). The reason is that storing something in a columnar format specifically means that columns (e.g fields in ROS schemas) would be stored contiguously, whereas this is still storing rows contiguously, just in a different order. The concept is similar but the answer to "is mcap a columnar format" would definitely be "no" in standard usage of the term. A row-oriented database stores rows in different contiguous tables just like the chunks here - but that doesn't make it columnar.

In terms of implementation and thinking about this, I think memory usage is going to be the main thing to evaluate. The go reader (and presumably the others) will hold fully decompressed whatever chunks are backing the messages in its heap, so if you have hundreds (or thousand+) topics and you are recording large files (such that you have enough messages on your topics to make full chunks), you can end up needing gigabytes of RAM to read one of these.

We have seen mcap files in the wild that are both in the 10s of GB in size, and that have over a thousand topics inside, so the scenario isn't hypothetical. The main issue that results from this IMO is that tooling/processes built for MCAP files needs to size memory allocations sufficiently to handle two very different kinds of requirements - so a process that handles large files on 1GB now might need 8GB or more if both types of input are anticipated.

@james-rms
Copy link
Collaborator Author

Closing for now.

@james-rms james-rms closed this Nov 30, 2022
@jtbandes jtbandes deleted the jrms/mcap-columnar branch June 9, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

CLI: Ability to write in columnar order
2 participants