You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I inadvertently introduced a regression here #8936 in 23.08 that can cause task failures.
The issue is at spill time where @revans2 reported seeing an exception in code he was testing the following:
import org.apache.spark.sql.tests.datagen._
val df = new DBGen().addTable("t", "a String, b long, c long", 1000000000).toDF(spark)
df.selectExpr("min(a)", "min(b + c)").show()
He can reproduce the following, not on every run but repeatable:
java.util.NoSuchElementException: Cannot locate buffers associated with ID: TempSpillBufferId(849,temp_local_9ea98814-78c5-4157-912b-735824c3a5a5)
at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$acquireBuffer$1(RapidsBufferCatalog.scala:409)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at com.nvidia.spark.rapids.RapidsBufferCatalog.acquireBuffer(RapidsBufferCatalog.scala:405)
at com.nvidia.spark.rapids.RapidsBufferCatalog.com$nvidia$spark$rapids$RapidsBufferCatalog$$updateUnderlyingRapidsBuffer(RapidsBufferCatalog.scala:387)
at com.nvidia.spark.rapids.RapidsBufferCatalog.trackNewHandle(RapidsBufferCatalog.scala:148)
at com.nvidia.spark.rapids.RapidsBufferCatalog.makeNewHandle(RapidsBufferCatalog.scala:129)
at com.nvidia.spark.rapids.RapidsBufferCatalog.$anonfun$addBuffer$1(RapidsBufferCatalog.scala:231)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.RapidsBufferCatalog.addBuffer(RapidsBufferCatalog.scala:230)
at com.nvidia.spark.rapids.RapidsBufferCatalog$.addBuffer(RapidsBufferCatalog.scala:884)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.$anonfun$addBatch$1(SpillableColumnarBatch.scala:211)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.addBatch(SpillableColumnarBatch.scala:196)
at com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:149)
at com.nvidia.spark.rapids.AggHelper.preProcess(aggregate.scala:290)
@revans2 and I have been going back and forth on testing this and finally it seemed related to #8936. I think it is the case where a RapidsBuffer is being aliased and at the same time it is being freed. Before the PR mentioned, this all happened under the RapidsBufferCatalog lock, that is no longer the case, so we are free to alias and then free right after (nothing is actually getting freed, but the code could fail with task failures). A patch to free the buffer while holding the catalog lock is fixing the issue for @revans2 who can consistently reproduce. I will continue testing, I was able to reproduce as well.
We discussed whether this could cause data corruption. We do not think so at this time, just task failures in case of heavy spill when a buffer is getting aliased while spill is happening on that buffer. Note that the NDS runs I have don't show this, but something about his example is.
The text was updated successfully, but these errors were encountered:
I inadvertently introduced a regression here #8936 in 23.08 that can cause task failures.
The issue is at spill time where @revans2 reported seeing an exception in code he was testing the following:
He can reproduce the following, not on every run but repeatable:
@revans2 and I have been going back and forth on testing this and finally it seemed related to #8936. I think it is the case where a
RapidsBuffer
is being aliased and at the same time it is being freed. Before the PR mentioned, this all happened under the RapidsBufferCatalog lock, that is no longer the case, so we are free to alias and then free right after (nothing is actually getting freed, but the code could fail with task failures). A patch to free the buffer while holding the catalog lock is fixing the issue for @revans2 who can consistently reproduce. I will continue testing, I was able to reproduce as well.We discussed whether this could cause data corruption. We do not think so at this time, just task failures in case of heavy spill when a buffer is getting aliased while spill is happening on that buffer. Note that the NDS runs I have don't show this, but something about his example is.
The text was updated successfully, but these errors were encountered: