Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 3-phase consistent hash based task assignment policy #833

Merged
merged 4 commits into from
Jul 20, 2023

Conversation

yahoNanJing
Copy link
Contributor

@yahoNanJing yahoNanJing commented Jun 30, 2023

Which issue does this PR close?

Closes #831.

Rationale for this change

What changes are included in this PR?

The three rounds cache aware task Scheduling are as follows:

  1. Assign non-map stage tasks(without scanning files) in a round robin way
  2. Assign map stage tasks (scanning files) based on the consistent hashing policy on the hash value of the file name and the executor topology
  3. Assign tasks with scanning files based on the consistent hashing policy on the hash value of the file name and the executor topology with N tolerance. These tasks will not trigger data caching.

Are there any user-facing changes?

@yahoNanJing
Copy link
Contributor Author

yahoNanJing commented Jun 30, 2023

Hi @collimarco, by this whole PR, the data cache feature will be feasible. If you are in urgent, you may have a try of running this PR. And let's consider to merge this PR after #830 merged.

@yahoNanJing yahoNanJing marked this pull request as ready for review July 19, 2023 02:27
@yahoNanJing yahoNanJing merged commit ba4d9d3 into apache:main Jul 20, 2023
16 checks passed
r4ntix added a commit to r4ntix/arrow-ballista that referenced this pull request Aug 9, 2023
* master: (67 commits)
  Update to DataFusion 28 (apache#858)
  Update hdfs requirement from 0.1.1 to 0.1.4 (apache#856)
  Bump word-wrap from 1.2.3 to 1.2.4 in /ballista/scheduler/ui (apache#849)
  Update hashbrown requirement from 0.13 to 0.14 (apache#846)
  Update etcd-client requirement from 0.10 to 0.11 (apache#845)
  Update itertools requirement from 0.10 to 0.11 (apache#844)
  Update tonic requirement from 0.8 to 0.9 (apache#733)
  Implement 3-phase consistent hash based task assignment policy (apache#833)
  Add ConsistentHash for node topology management (apache#830)
  Introduce CachedBasedObjectStoreRegistry to use data source cache transparently (apache#827)
  Fix cargo clippy for latest rust version (apache#848)
  Introduce a cache crate supporting concurrent cache value loading based on the cache_system crate of influxdb_iox and the linked_hash_map mod from hashlink (apache#825)
  Update libloading requirement from 0.7.3 to 0.8.0 (apache#761)
  Update dirs requirement from 4.0.0 to 5.0.1 (apache#767)
  Update flatbuffers requirement from 22.9.29 to 23.5.26 (apache#801)
  Bump tough-cookie from 4.1.2 to 4.1.3 in /ballista/scheduler/ui (apache#840)
  Bump actions/labeler from 4.1.0 to 4.3.0 (apache#841)
  Bump semver from 5.7.1 to 5.7.2 in /ballista/scheduler/ui (apache#843)
  Reduce the number of calls to create_logical_plan (apache#842)
  Upgrade DataFusion to 27.0.0 (apache#834)
  ...

# Conflicts:
#	ballista/scheduler/src/state/mod.rs
#	ballista/scheduler/src/state/session_manager.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement 3-phase consistent hash based task assignment policy
2 participants