Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GpuSemaphore to support multiple threads per task [databricks] #9501

Merged
merged 2 commits into from
Oct 23, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Oct 20, 2023

This fixes #9485

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2
Copy link
Collaborator Author

revans2 commented Oct 20, 2023

build

if (done) {
moveToActive(t)
}
shouldBlockOnSemaphore = t == blockedThreads.peek
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a little bit to understand why we need to check the last item in blockedThreads but I think I get it: there's a race to after adding to blockedThreads and calling blockUntilReady, so you want to have all the "winners" block while the last thread to get added to blockedThreads is the one that will flag the semaphore as acquired and notifies the blocked threads.

Should shouldBlockOnSemaphore be: shouldNotBlockOnSemaphore? If true as it is (we are the last thread inserted to blockedThreads) we won't block, instead we are the notifier thread, if I don't misunderstand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just saying this thread is at the head of the queue so if it doesn't have semaphore already this one should be blocked on it and try ot acquire it? If they aren't the head of the queue then wait()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I need to add in some comments to explain what is happening.

Comment on lines 205 to 207
// So it is really more of a sanity test. In reality there should be no threads
// that are blocked, but one might have been added here when the semaphore was held
// and now it is being released so it will block.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too clear on this comment. Could you provide an example where you think this could happen?

It seems the notify logic above is doing the right things. I was trying to come up with a scenario to match the concern, but I am not sure I have it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it could happen. That is why it is an error.

}
val taskInfo = tasks.computeIfAbsent(taskAttemptId, key => {
onTaskCompletion(context, completeTask)
new TaskInfo(key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename key to be taskId

if (done) {
moveToActive(t)
}
shouldBlockOnSemaphore = t == blockedThreads.peek
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just saying this thread is at the head of the queue so if it doesn't have semaphore already this one should be blocked on it and try ot acquire it? If they aren't the head of the queue then wait()

@tgravescs
Copy link
Collaborator

all my minor comments about comments or name changes can be ignored in this pr to not delay 23.10.

@revans2
Copy link
Collaborator Author

revans2 commented Oct 20, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Oct 20, 2023

@abellina and @tgravescs I have updated the PR to have more comments about what it is doing and cleaned up some of the code just a little. Please take another look.

@sameerz sameerz added the bug Something isn't working label Oct 23, 2023
@revans2 revans2 linked an issue Oct 23, 2023 that may be closed by this pull request
@revans2
Copy link
Collaborator Author

revans2 commented Oct 23, 2023

The link check failure is not related to this PR

@abellina abellina merged commit f4c986b into NVIDIA:branch-23.10 Oct 23, 2023
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GpuSemaphore can deadlock if there are multiple threads per task
4 participants