Improve content store compression using preset dictionaries #345

jan-niestadt · 2022-09-05T11:31:06Z

zlib support preset dictionaries, which is a way to improve compression if you know something about the structure of your data ahead of time. See https://www.ietf.org/rfc/rfc1950.txt

In our case, using part of the first document stored as the preset dictionary for each block in the content store would probably improve the compression ratio.

jan-niestadt · 2022-09-05T11:32:17Z

(comment in doc/index-formats/integrated.md:) A reasonable approach could be to take a chunk from the middle of the first file added (middle to increase the chance we're inside actual text, not metadata) and use that as the dictionary for the entire segment. This should ensure common strings (e.g. XML tags, attributes, common words, etc.) are stored more efficiently in each block.

jan-niestadt added the enhancement label Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve content store compression using preset dictionaries #345

Improve content store compression using preset dictionaries #345

jan-niestadt commented Sep 5, 2022

jan-niestadt commented Sep 5, 2022

Improve content store compression using preset dictionaries #345

Improve content store compression using preset dictionaries #345

Comments

jan-niestadt commented Sep 5, 2022

jan-niestadt commented Sep 5, 2022