[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

abellina · 2023-08-18T21:05:58Z

I inadvertently introduced a regression here #8936 in 23.08 that can cause task failures.

The issue is at spill time where @revans2 reported seeing an exception in code he was testing the following:

import org.apache.spark.sql.tests.datagen._
val df = new DBGen().addTable("t", "a String, b long, c long", 1000000000).toDF(spark)
df.selectExpr("min(a)", "min(b + c)").show()

He can reproduce the following, not on every run but repeatable:

java.util.NoSuchElementException: Cannot locate buffers associated with ID: TempSpillBufferId(849,temp_local_9ea98814-78c5-4157-912b-735824c3a5a5)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$acquireBuffer$1(RapidsBufferCatalog.scala:409)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.acquireBuffer(RapidsBufferCatalog.scala:405)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.com$nvidia$spark$rapids$RapidsBufferCatalog$$updateUnderlyingRapidsBuffer(RapidsBufferCatalog.scala:387)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.trackNewHandle(RapidsBufferCatalog.scala:148)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.makeNewHandle(RapidsBufferCatalog.scala:129)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$addBuffer$1(RapidsBufferCatalog.scala:231)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
  at com.nvidia.spark.rapids.RapidsBufferCatalog.addBuffer(RapidsBufferCatalog.scala:230)
  at com.nvidia.spark.rapids.RapidsBufferCatalog$.addBuffer(RapidsBufferCatalog.scala:884)
  at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:211)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
  at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:196)
  at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:149)
  at com.nvidia.spark.rapids.AggHelper.preProcess(aggregate.scala:290)

@revans2 and I have been going back and forth on testing this and finally it seemed related to #8936. I think it is the case where a RapidsBuffer is being aliased and at the same time it is being freed. Before the PR mentioned, this all happened under the RapidsBufferCatalog lock, that is no longer the case, so we are free to alias and then free right after (nothing is actually getting freed, but the code could fail with task failures). A patch to free the buffer while holding the catalog lock is fixing the issue for @revans2 who can consistently reproduce. I will continue testing, I was able to reproduce as well.

We discussed whether this could cause data corruption. We do not think so at this time, just task failures in case of heavy spill when a buffer is getting aliased while spill is happening on that buffer. Note that the NDS runs I have don't show this, but something about his example is.

The text was updated successfully, but these errors were encountered:

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 18, 2023

abellina self-assigned this Aug 18, 2023

This was referenced Aug 21, 2023

Fix race when a rapids buffer is aliased while it is spilled #9084

Merged

[BUG] q95 odd task failure in query95 at 30TB #8939

Closed

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 22, 2023

revans2 closed this as completed in #9084 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

abellina commented Aug 18, 2023

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression) #9082

Comments

abellina commented Aug 18, 2023