Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Theta sketch compression #15731

Open
AlexanderSaydakov opened this issue Jan 19, 2024 · 4 comments
Open

Use Theta sketch compression #15731

AlexanderSaydakov opened this issue Jan 19, 2024 · 4 comments

Comments

@AlexanderSaydakov
Copy link
Contributor

AlexanderSaydakov commented Jan 19, 2024

Theta sketch compression is available for quite some time in the Apache DataSketches library. I would suggest enabling it in Druid. The simplest way would be to start serializing Theta sketches in compressed format. Deserialization automatically detects and supports that format starting from datasketches-java-4.0.0 and datasketches-cpp-4.1.0 (May 2023).
There is some overhead in converting sketches to bytes, but in an I/O bound system usually this is a reasonable CPU vs I/O tradeoff. In other words, compression reduces I/O (and storage cost) by spending more CPU, which is likely to yield overall benefit.

Theta sketch compressed size
@AlexanderSaydakov
Copy link
Contributor Author

Alternative to compression by default would be making it configurable by the user per column or per table or per installation or some other way. I am not sure this extra complexity is needed.

@abhishekagarwal87
Copy link
Contributor

How much of a CPU overhead does the compression come with?

@AlexanderSaydakov
Copy link
Contributor Author

Sorry I could not find measurements in Java. I will run them again, but that takes quite a while. Here are measurements in C++ just to have some idea.
Theta sketch compression time C++

@AlexanderSaydakov
Copy link
Contributor Author

This time is just to convert sketches to bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants