Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve content store compression using preset dictionaries #345

Open
jan-niestadt opened this issue Sep 5, 2022 · 1 comment
Open

Improve content store compression using preset dictionaries #345

jan-niestadt opened this issue Sep 5, 2022 · 1 comment

Comments

@jan-niestadt
Copy link
Member

zlib support preset dictionaries, which is a way to improve compression if you know something about the structure of your data ahead of time. See https://www.ietf.org/rfc/rfc1950.txt

In our case, using part of the first document stored as the preset dictionary for each block in the content store would probably improve the compression ratio.

@jan-niestadt
Copy link
Member Author

(comment in doc/index-formats/integrated.md:) A reasonable approach could be to take a chunk from the middle of the first file added (middle to increase the chance we're inside actual text, not metadata) and use that as the dictionary for the entire segment. This should ensure common strings (e.g. XML tags, attributes, common words, etc.) are stored more efficiently in each block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant