Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

yahoNanJing · 2023-06-28T13:13:06Z

Which issue does this PR close?

Closes #708.

Rationale for this change

What changes are included in this PR?

The ExecutorReservation has been removed and the related task assignment philosophy is also changed from executor first to task first. This change will bring many benefits:
- Necessary for consistency-based data cache aware task scheduling as mentioned in Change the task assignment philosophy from executor first to task first #708.
- Avoid too frequently task scheduling, since one task update may cause a task scheduling, which may cause too much state contention and slow down the whole throughput. Based on this PR, we can do further improvement, like batch event processing to improve the efficiency. Sample code is here. Running with the branch, the throughput can be improved by around 50%. And later I will raise PRs step by step to contribute the branch to the main branch.
The metrics for the pending task number is changed to be pending job number and running job number, since the calculation for the accurate metrics for the pending task number will be a bit heavy and unnecessary.

Are there any user-facing changes?

…from executor first to task first

…ASKS_METRIC_NAME

yahoNanJing · 2023-06-28T13:38:26Z

Hi @thinkharderdev and @Dandandan, could you help review this PR?

collimarco · 2023-06-29T10:24:47Z

+1 for merging this, so that it enables the implementation of this major feature: #645

thinkharderdev

Had some time to start reviewing today. Will try and add some more specific comments online but after reviewing the overall design I have some thoughts in no particular order:

This seems sensible. I'll admit I didn't quite understand what a "task first" philosophy would consist of when the issue was first raised but if I understand this PR correctly, it just means that we always returns task slots to the pool when a task updates and then schedule tasks in batches where we bind any available task slot to any available schedulable task.

The current work is focused on enabling caching on the executors but this approach would also I believe make streaming execution (or something like bubble execution) more natural since you have the context of all running jobs and the entire executor topology when making scheduling decisions.

I'm still a little confused as to why this is required to enable caching. If we want to do something like consistent hashing to try and assign certain map tasks to certain executors this seems like something we could do in the current model. You've obviously thought about this much more than me so I'm sure I'm missing something but not sure what :)

IIRC you guys are using a single scheduler with all state in memory. In that case I don't think this approach has much downside, but if running multiple active schedulers with shared state for task slots then I'm worried this will have some negative consequences:

It may cause more contention on the global state as you would have to hold a global lock on the state while you bind tasks. We currently use redis for the task slots which allows us to reserve/free slots atomically without locks (using a bit of lua scripting) which gives us a very cheap way to maintain a shared view of task slot reservations across multiple schedulers. I think you could effectively do the same thing in this model but you'd have to first inspect all active jobs to figure out how many task you have available, then reserve the slots (or as many as you can up to that amount) and then bind the tasks. That is, it would effectively reduce to what we are doing now.
The original goal of the ExecutorReservation was to minimize contention on the task slots state. When a task slot is freed, the curator scheduler still "owns" it so can re-schedule without returning it to the global pool. In this model whenever we have task updates we need to return all slots to the pool and then immediately go back and request them again.

yahoNanJing · 2023-07-03T02:26:39Z

Thanks @thinkharderdev for your comments.

I'm still a little confused as to why this is required to enable caching.

For consistent hashing based task assignment, we should do the task assignment based on the scan files of the task if there is. The details is described in #833. This means it's necessary to assign a specific executor for a task rather than assign a random task for an executor. To achieve good data cache ware task scheduling, it's necessary for the scheduler to have a global view of the cluster's executor slots.

The original goal of the ExecutorReservation was to minimize contention on the task slots state.

I totally understand the purpose of ExecutorReservation. However, for the current implementation, it actually does not reduce the contention too much. https://github.com/apache/arrow-ballista/blob/b65464e4b73590470fa69aad5b6954300ad243a0/ballista/scheduler/src/state/mod.rs#L190-L228

From the above code, if there are still some pending tasks, it will still go to invoke reserve_slots.

To reduce the resource contention or lock contention, based on this PR, I'll raise another PR to refactor the event processing to introduce batch event processing. For example, to combine 10 task status update event to one so that only one resource contention will be involved. Sample code can be found here With this new implementation, the throughput can be improved around 50% on our load testing.

And for a cluster with multiple active schedulers, the reservation mechanism may cause some scheduler hungry.

yahoNanJing · 2023-07-10T10:31:25Z

Since this PR has been under review for half a month, if there's no opposite options, I'll merge this in next a few days so that the data cache related PRs can go on.

kyotoYaho added 2 commits June 28, 2023 21:07

Remove ExecutorReservation and change the task assignment philosophy …

fa1cad0

…from executor first to task first

Add PENDING_JOBS_METRIC and RUNNING_JOBS_METRIC to replace INFLIGHT_T…

8d574fd

…ASKS_METRIC_NAME

yahoNanJing requested review from Dandandan and thinkharderdev June 28, 2023 13:13

yahoNanJing mentioned this pull request Jun 30, 2023

Introduce a cache crate supporting concurrent cache value loading #825

Merged

thinkharderdev reviewed Jul 1, 2023

View reviewed changes

yahoNanJing merged commit 5dfdfea into apache:main Jul 13, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

yahoNanJing commented Jun 28, 2023 •

edited

Loading

yahoNanJing commented Jun 28, 2023

collimarco commented Jun 29, 2023

thinkharderdev left a comment

yahoNanJing commented Jul 3, 2023 •

edited

Loading

yahoNanJing commented Jul 10, 2023

Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

Conversation

yahoNanJing commented Jun 28, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

yahoNanJing commented Jun 28, 2023

collimarco commented Jun 29, 2023

thinkharderdev left a comment

Choose a reason for hiding this comment

yahoNanJing commented Jul 3, 2023 • edited Loading

yahoNanJing commented Jul 10, 2023

yahoNanJing commented Jun 28, 2023 •

edited

Loading

yahoNanJing commented Jul 3, 2023 •

edited

Loading