-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GpuSemaphore to support multiple threads per task [databricks] #9501
Conversation
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
build |
if (done) { | ||
moveToActive(t) | ||
} | ||
shouldBlockOnSemaphore = t == blockedThreads.peek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a little bit to understand why we need to check the last item in blockedThreads
but I think I get it: there's a race to after adding to blockedThreads
and calling blockUntilReady
, so you want to have all the "winners" block while the last thread to get added to blockedThreads is the one that will flag the semaphore as acquired and notifies the blocked threads.
Should shouldBlockOnSemaphore
be: shouldNotBlockOnSemaphore
? If true as it is (we are the last thread inserted to blockedThreads
) we won't block, instead we are the notifier thread, if I don't misunderstand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this just saying this thread is at the head of the queue so if it doesn't have semaphore already this one should be blocked on it and try ot acquire it? If they aren't the head of the queue then wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I need to add in some comments to explain what is happening.
// So it is really more of a sanity test. In reality there should be no threads | ||
// that are blocked, but one might have been added here when the semaphore was held | ||
// and now it is being released so it will block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too clear on this comment. Could you provide an example where you think this could happen?
It seems the notify logic above is doing the right things. I was trying to come up with a scenario to match the concern, but I am not sure I have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it could happen. That is why it is an error.
} | ||
val taskInfo = tasks.computeIfAbsent(taskAttemptId, key => { | ||
onTaskCompletion(context, completeTask) | ||
new TaskInfo(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename key to be taskId
if (done) { | ||
moveToActive(t) | ||
} | ||
shouldBlockOnSemaphore = t == blockedThreads.peek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this just saying this thread is at the head of the queue so if it doesn't have semaphore already this one should be blocked on it and try ot acquire it? If they aren't the head of the queue then wait()
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala
Outdated
Show resolved
Hide resolved
all my minor comments about comments or name changes can be ignored in this pr to not delay 23.10. |
build |
@abellina and @tgravescs I have updated the PR to have more comments about what it is doing and cleaned up some of the code just a little. Please take another look. |
The link check failure is not related to this PR |
This fixes #9485