Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

Closed
abellina opened this issue Aug 18, 2023 · 0 comments · Fixed by #9084
Closed

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

abellina opened this issue Aug 18, 2023 · 0 comments · Fixed by #9084
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Collaborator

I inadvertently introduced a regression here #8936 in 23.08 that can cause task failures.

The issue is at spill time where @revans2 reported seeing an exception in code he was testing the following:

import org.apache.spark.sql.tests.datagen._
val df = new DBGen().addTable("t", "a String, b long, c long", 1000000000).toDF(spark)
df.selectExpr("min(a)", "min(b + c)").show()

He can reproduce the following, not on every run but repeatable:

java.util.NoSuchElementException: Cannot locate buffers associated with ID: TempSpillBufferId(849,temp_local_9ea98814-78c5-4157-912b-735824c3a5a5)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$acquireBuffer$1(RapidsBufferCatalog.scala:409)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.acquireBuffer(RapidsBufferCatalog.scala:405)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.com$nvidia$spark$rapids$RapidsBufferCatalog$$updateUnderlyingRapidsBuffer(RapidsBufferCatalog.scala:387)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.trackNewHandle(RapidsBufferCatalog.scala:148)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.makeNewHandle(RapidsBufferCatalog.scala:129)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$addBuffer$1(RapidsBufferCatalog.scala:231)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.addBuffer(RapidsBufferCatalog.scala:230)
  at com.nvidia.spark.rapids.RapidsBufferCatalog$.addBuffer(RapidsBufferCatalog.scala:884)
  at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:211)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
  at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:196)
  at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:149)
  at com.nvidia.spark.rapids.AggHelper.preProcess(aggregate.scala:290)

@revans2 and I have been going back and forth on testing this and finally it seemed related to #8936. I think it is the case where a RapidsBuffer is being aliased and at the same time it is being freed. Before the PR mentioned, this all happened under the RapidsBufferCatalog lock, that is no longer the case, so we are free to alias and then free right after (nothing is actually getting freed, but the code could fail with task failures). A patch to free the buffer while holding the catalog lock is fixing the issue for @revans2 who can consistently reproduce. I will continue testing, I was able to reproduce as well.

We discussed whether this could cause data corruption. We do not think so at this time, just task failures in case of heavy spill when a buffer is getting aliased while spill is happening on that buffer. Note that the NDS runs I have don't show this, but something about his example is.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 18, 2023
@abellina abellina self-assigned this Aug 18, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants