Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SerializeConcatHostBuffersDeserializeBatch may have thread issues #1179

Closed
jlowe opened this issue Nov 20, 2020 · 1 comment · Fixed by #1264
Closed

[BUG] SerializeConcatHostBuffersDeserializeBatch may have thread issues #1179

jlowe opened this issue Nov 20, 2020 · 1 comment · Fixed by #1264
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@jlowe
Copy link
Member

jlowe commented Nov 20, 2020

#1174 fixed an issue initializing the columnar batch row count on a degenerate batch that was row-only, but since that change is necessary to fix an issue it implies threads are trying to access the batch while it is being created. The concern is if there could be multiple threads trying to create a normal GPU batch simultaneously and end up leaking all but the last one recorded.

It's also disconcerting that #1165 needed to be reverted, so we need to understand what the threading implications are here and make sure the serializer is doing the correct thing in light of multiple thread access.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 20, 2020
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Nov 24, 2020
@andygrove
Copy link
Contributor

andygrove commented Dec 3, 2020

Here is the stack trace that we saw in the integration tests that led to us reverting the previous PR.

20:30:07  20/11/20 03:30:09 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 21, 10.233.104.56, executor 0): java.io.StreamCorruptedException: unexpected block data
20:30:07  	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1663)
20:30:07  	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365)
20:30:07  	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289)
20:30:07  	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147)
20:30:07  	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646)
20:30:07  	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:482)
20:30:07  	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:440)
20:30:07  	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
20:30:07  	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
20:30:07  	at scala.Option.getOrElse(Option.scala:189)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
20:30:07  	at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
20:30:07  	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:218)
20:30:07  	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
20:30:07  	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
20:30:07  	at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$lzycompute$1(GpuBroadcastHashJoinExec.scala:139)
20:30:07  	at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.builtTable$1(GpuBroadcastHashJoinExec.scala:137)
20:30:07  	at com.nvidia.spark.rapids.shims.spark301.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$5(GpuBroadcastHashJoinExec.scala:154)
20:30:07  	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
20:30:07  	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
20:30:07  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
20:30:07  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
20:30:07  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
20:30:07  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
20:30:07  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
20:30:07  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
20:30:07  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
20:30:07  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
20:30:07  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
20:30:07  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
20:30:07  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
20:30:07  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
20:30:07  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
20:30:07  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
20:30:07  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
20:30:07  	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
20:30:07  	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
20:30:07  	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
20:30:07  	at org.apache.spark.scheduler.Task.run(Task.scala:127)
20:30:07  	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
20:30:07  	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
20:30:07  	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
20:30:07  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
20:30:07  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
20:30:07  	at java.lang.Thread.run(Thread.java:748)

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
[auto-merge] bot-auto-merge-branch-23.06 to branch-23.08 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants