Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a source data cache layer #645

Open
6 of 7 tasks
yahoNanJing opened this issue Feb 1, 2023 · 2 comments
Open
6 of 7 tasks

Introduce a source data cache layer #645

yahoNanJing opened this issue Feb 1, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Feb 1, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In a cloud native architecture with completely stateless executors, each executor needs to fetch the source data from the remote storage. If the amount of source data is very large, it will easily meet the network throughput bottleneck and it will take too much time for this step of fetching source data. For example, for a compute layer with 20 nodes with 10Gb node bandwidth, it will take at least 4s to fetch 100GB source data. While it often takes less than 1s to finish other steps of processing the 100GB data. Therefore, it’s better to introduce a cache layer into the cloud native architecture to make the executors be of weak state for caching hot data on local disk, like snowflake does.

Describe the solution you'd like

https://docs.google.com/document/d/1iMFv3S-TuiwBoTzp4KX0Ltrrenm86ULr0q_PwIKdW6g/edit?usp=sharing

To achieve this goal, we need to finish the following tasks:

Describe alternatives you've considered

Additional context

@collimarco
Copy link

I am definitely interested in this feature, thanks for posting this. I came to the exact same conclusions while testing on large datasets stored on S3: the approach suggested here is definitely the best. The bandwidth is the bottleneck and sending the requests to the same executors, that cache on disk, using an hashing algorithm, is definitely the best solution.

@BertHartm
Copy link

I see that the only unchecked box (#833) has been merged. Does that mean this work is complete? or is there more to be done to achieve the goal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants