Introduce a source data cache layer #645

yahoNanJing · 2023-02-01T11:03:29Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In a cloud native architecture with completely stateless executors, each executor needs to fetch the source data from the remote storage. If the amount of source data is very large, it will easily meet the network throughput bottleneck and it will take too much time for this step of fetching source data. For example, for a compute layer with 20 nodes with 10Gb node bandwidth, it will take at least 4s to fetch 100GB source data. While it often takes less than 1s to finish other steps of processing the 100GB data. Therefore, it’s better to introduce a cache layer into the cloud native architecture to make the executors be of weak state for caching hot data on local disk, like snowflake does.

Describe the solution you'd like

https://docs.google.com/document/d/1iMFv3S-TuiwBoTzp4KX0Ltrrenm86ULr0q_PwIKdW6g/edit?usp=sharing

To achieve this goal, we need to finish the following tasks:

(Executor) Introduce a 2-tiered cache manager for caching data on remote storage based on local memory and disk
(Scheduler) Add ConsistentHash for node topology management #830
(Scheduler) Implement 3-phase consistent hash based task assignment policy #833
(Scheduler) Add executor self-registration mechanism in the heartbeat service #648, Add executor self-registration mechanism in the heartbeat service #649

Describe alternatives you've considered

Additional context

collimarco · 2023-06-29T10:19:23Z

I am definitely interested in this feature, thanks for posting this. I came to the exact same conclusions while testing on large datasets stored on S3: the approach suggested here is definitely the best. The bandwidth is the bottleneck and sending the requests to the same executors, that cache on disk, using an hashing algorithm, is definitely the best solution.

BertHartm · 2024-05-10T14:34:35Z

I see that the only unchecked box (#833) has been merged. Does that mean this work is complete? or is there more to be done to achieve the goal?

yahoNanJing added the enhancement New feature or request label Feb 1, 2023

collimarco mentioned this issue Jun 29, 2023

Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a source data cache layer #645

Introduce a source data cache layer #645

yahoNanJing commented Feb 1, 2023 •

edited

Loading

collimarco commented Jun 29, 2023

BertHartm commented May 10, 2024

Introduce a source data cache layer #645

Introduce a source data cache layer #645

Comments

yahoNanJing commented Feb 1, 2023 • edited Loading

collimarco commented Jun 29, 2023

BertHartm commented May 10, 2024

yahoNanJing commented Feb 1, 2023 •

edited

Loading