WIP: Low shuffle merge implementation. #10753

liurenjie1024 · 2024-04-30T10:09:42Z

No description provided.

jlowe · 2024-05-01T19:01:49Z

...common/src/main/delta-io/scala/com/nvidia/spark/rapids/delta/GpuDeltaParquetFileFormat.scala

2024 copyrights. Comment applies to other files as well.

jlowe · 2024-05-01T19:05:41Z

...elta-24x/src/main/scala/org/apache/spark/sql/delta/rapids/delta24x/GpuMergeIntoCommand.scala

+                              @JsonDeserialize(contentAs = classOf[java.lang.Long])
+                              rows: Option[Long] = None,
+                              @JsonDeserialize(contentAs = classOf[java.lang.Long])
+                              files: Option[Long] = None,
+                              @JsonDeserialize(contentAs = classOf[java.lang.Long])
+                              bytes: Option[Long] = None,
+                              @JsonDeserialize(contentAs = classOf[java.lang.Long])
+                              partitions: Option[Long] = None)


Why was this reformatted? It doesn't match the project coding style. Comment applies to elsewhere in this file. A reformat of this file makes it much harder to see the actual changes. The reformat is not related to this PR, please revert it.

jlowe · 2024-05-01T19:08:45Z

...elta-24x/src/main/scala/org/apache/spark/sql/delta/rapids/delta24x/GpuMergeIntoCommand.scala

-            val newWrittenFiles = withStatusCode("DELTA", "Writing merged data") {
-              writeAllChanges(spark, deltaTxn, filesToRewrite)
-            }
+            val newWrittenFiles = lowShuffleMerge(spark, deltaTxn, filesToRewrite)


There should be a config to control whether low shuffle merge is performed or not, especially on Delta Lake versions that do not have a low shuffle merge implementation (like OSS Delta Lake).

jlowe · 2024-05-01T19:24:08Z

...elta-24x/src/main/scala/org/apache/spark/sql/delta/rapids/delta24x/GpuMergeIntoCommand.scala

+  private def lowShuffleMerge(spark: SparkSession,
+                              deltaTxn: OptimisticTransaction,
+                              filesToRewrite: Seq[AddFile]): Seq[FileAction] = {
+    val executor = new LowShuffleMergeExecutor(spark, deltaTxn, filesToRewrite, this)


We may want to consider using a GpuLowShuffleMergeIntoCommand separate from GpuMergeIntoCommand (both leveraging common code as appropriate) so this can appear in the SQL UI and make it clear when we're using low shuffle vs. not.

razajafri

Please revert reformats for reviewers to see the actual changes

liurenjie1024 · 2024-05-09T06:58:35Z

Closed by #10786

liurenjie1024 marked this pull request as draft April 30, 2024 10:14

feat: Introduce low shuffle merge.

982a926

liurenjie1024 force-pushed the renjie/lowshufflemerge branch from 7e1c587 to 982a926 Compare April 30, 2024 10:26

sameerz added the performance A performance related task/issue label May 1, 2024

jlowe reviewed May 1, 2024

View reviewed changes

razajafri reviewed May 2, 2024

View reviewed changes

liurenjie1024 closed this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Low shuffle merge implementation. #10753

WIP: Low shuffle merge implementation. #10753

liurenjie1024 commented Apr 30, 2024

jlowe May 1, 2024

jlowe May 1, 2024

jlowe May 1, 2024

jlowe May 1, 2024

razajafri left a comment

liurenjie1024 commented May 9, 2024

WIP: Low shuffle merge implementation. #10753

WIP: Low shuffle merge implementation. #10753

Conversation

liurenjie1024 commented Apr 30, 2024

jlowe May 1, 2024

Choose a reason for hiding this comment

jlowe May 1, 2024

Choose a reason for hiding this comment

jlowe May 1, 2024

Choose a reason for hiding this comment

jlowe May 1, 2024

Choose a reason for hiding this comment

razajafri left a comment

Choose a reason for hiding this comment

liurenjie1024 commented May 9, 2024