CLI: add columnize subcommand #747

james-rms · 2022-11-25T05:45:36Z

Public-Facing Changes

Description

Fixes #686

wkalt · 2022-11-26T19:06:02Z

If we end up baking this functionality into the tool, I'd propose calling it something more like "sort" or "group by" than columnize (or some combination of sort and group by - sort can also be used to get a physically time-ordered file, but some modifier seems needed to convey the partitioning into group-wise chunks, which doesn't apply in the case of a time sort). The reason is that storing something in a columnar format specifically means that columns (e.g fields in ROS schemas) would be stored contiguously, whereas this is still storing rows contiguously, just in a different order. The concept is similar but the answer to "is mcap a columnar format" would definitely be "no" in standard usage of the term. A row-oriented database stores rows in different contiguous tables just like the chunks here - but that doesn't make it columnar.

In terms of implementation and thinking about this, I think memory usage is going to be the main thing to evaluate. The go reader (and presumably the others) will hold fully decompressed whatever chunks are backing the messages in its heap, so if you have hundreds (or thousand+) topics and you are recording large files (such that you have enough messages on your topics to make full chunks), you can end up needing gigabytes of RAM to read one of these.

We have seen mcap files in the wild that are both in the 10s of GB in size, and that have over a thousand topics inside, so the scenario isn't hypothetical. The main issue that results from this IMO is that tooling/processes built for MCAP files needs to size memory allocations sufficiently to handle two very different kinds of requirements - so a process that handles large files on 1GB now might need 8GB or more if both types of input are anticipated.

james-rms · 2022-11-30T23:25:01Z

Closing for now.

james-rms added 9 commits November 24, 2022 11:46

add columnize

edd4862

factor out to empty chunkwriter

760f928

wip, more on this chunkwriter

05fb2cc

always initialize first chunkwriter

d1286b5

chunkwriter is working

bc75c81

show topic names in chunks

f7ccb6c

implement all columnizers

7ae6bae

lint

aa9785a

fix docstring

ca4056c

james-rms force-pushed the jrms/mcap-columnar branch from fb459c3 to ca4056c Compare November 25, 2022 09:12

james-rms added 3 commits November 28, 2022 08:15

it's better now

a055fa6

move to groupby

08eccd3

remove more references to columns

90edf15

james-rms closed this Nov 30, 2022

wkalt mentioned this pull request Dec 4, 2022

(POC) optimize subcommand for preloading #687

Closed

jtbandes deleted the jrms/mcap-columnar branch June 9, 2023 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: add columnize subcommand #747

CLI: add columnize subcommand #747

james-rms commented Nov 25, 2022 •

edited

Loading

wkalt commented Nov 26, 2022

james-rms commented Nov 30, 2022

CLI: add columnize subcommand #747

CLI: add columnize subcommand #747

Conversation

james-rms commented Nov 25, 2022 • edited Loading

wkalt commented Nov 26, 2022

james-rms commented Nov 30, 2022

james-rms commented Nov 25, 2022 •

edited

Loading