Figure out why `MapFromArrays` appears in the tests for hive parquet write #10948

firestarman · 2024-05-30T02:26:46Z

Describe the bug

PR #10912 introduces the parquet support for GpuInsertIntoHiveTable, along with the relevant tests. In some of the tests on Databricks, the ProjectExec will fall back to CPU due to missing the GPU version of the MapFromArrays expression.

It is better to find out the root cause of why this expression appears only in these tests on Databricks.

The text was updated successfully, but these errors were encountered:

firestarman · 2024-06-03T01:54:58Z

This also happens on Spark 351. See #10956

revans2 · 2024-06-17T16:32:16Z

Added back in needs triage because if we really need to understand what is happening. If we cannot do something simple with DB like this it is either a bug in our code or theirs and we need to know which.

mattahrens · 2024-06-18T20:09:53Z

@firestarman this needs to be investigated to figure out the root cause given we'll have an unneeded fallback with this feature on Databricks.

firestarman · 2024-06-19T07:49:28Z

This MapFromArrays also appears in the tests on Spark 350+.

However this is not a Plugin bug, I think. Because Spark 350+ generates a different plan for the Hive style write (INSERT OVERWRITE TABLE a_new_created_table SELECT * FROM another_table) than Spark 34x.

Spark 341

*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHiveTable> will run on GPU
  *Exec <WriteFilesExec> will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> _c0#30 could run on GPU

Spark 350 and 351

*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHiveTable> will run on GPU
  *Exec <WriteFilesExec> will run on GPU
    !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
      @Expression <Alias> map_from_arrays(transform(map_keys(_c0#26), lambdafunction(lambda key#37, lambda key#37, false)), transform(map_values(_c0#26), lambdafunction(lambda value#39, lambda value#39, false))) AS _c0#41 could run on GPU
        ! <MapFromArrays> map_from_arrays(transform(map_keys(_c0#26), lambdafunction(lambda key#37, lambda key#37, false)), transform(map_values(_c0#26), lambdafunction(lambda value#39, lambda value#39, false))) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.MapFromArrays
          @Expression <ArrayTransform> transform(map_keys(_c0#26), lambdafunction(lambda key#37, lambda key#37, false)) could run on GPU
            @Expression <MapKeys> map_keys(_c0#26) could run on GPU
              @Expression <AttributeReference> _c0#26 could run on GPU
            @Expression <LambdaFunction> lambdafunction(lambda key#37, lambda key#37, false) could run on GPU
              @Expression <NamedLambdaVariable> lambda key#37 could run on GPU
              @Expression <NamedLambdaVariable> lambda key#37 could run on GPU
          @Expression <ArrayTransform> transform(map_values(_c0#26), lambdafunction(lambda value#39, lambda value#39, false)) could run on GPU
            @Expression <MapValues> map_values(_c0#26) could run on GPU
              @Expression <AttributeReference> _c0#26 could run on GPU
            @Expression <LambdaFunction> lambdafunction(lambda value#39, lambda value#39, false) could run on GPU
              @Expression <NamedLambdaVariable> lambda value#39 could run on GPU
              @Expression <NamedLambdaVariable> lambda value#39 could run on GPU
      ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
        @Expression <AttributeReference> _c0#26 could run on GPU

But the CTAS command (CREATE TABLE new_table STORED AS PARQUET AS SELECT * FROM a_existing_table) still has the same plan tree.

*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHiveTable> will run on GPU
  *Exec <WriteFilesExec> will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> _c0#0 could run on GPU

revans2 · 2024-06-24T15:05:32Z

This appears to be coming form https://github.com/apache/spark/blob/fd86f85e181fc2dc0f50a096855acf83a6cc5d9c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L381-L421

It appears that https://issues.apache.org/jira/browse/SPARK-42151 apache/spark#40308

So technically this is a regression, more accurately a performance regression, in that we could run the query fully on the GPU before, but now we cannot.

revans2 · 2024-06-24T15:10:27Z

@sameerz and @mattahrens we now know why the regression has happened and we need to decide what the next steps are. Implementing this is not too difficult. We mainly need to verify that the array lengths are the same everywhere and then pull out the data column from each of the arrays and turn them into a struct.

firestarman added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests improve and removed bug Something isn't working ? - Needs Triage Need team to review and classify labels May 30, 2024

This was referenced May 30, 2024

GpuInsertIntoHiveTable supports parquet format #10912

Merged

[BUG] hive_parquet_write_test.py: test_write_compressed_parquet_into_hive_table integration test failures #10956

Closed

firestarman mentioned this issue Jun 17, 2024

Support bucketing write for GPU #10957

Merged

revans2 added the ? - Needs Triage Need team to review and classify label Jun 17, 2024

mattahrens assigned firestarman Jun 18, 2024

mattahrens added bug Something isn't working and removed improve ? - Needs Triage Need team to review and classify test Only impacts tests labels Jun 18, 2024

revans2 added performance A performance related task/issue ? - Needs Triage Need team to review and classify labels Jun 24, 2024

revans2 unassigned firestarman Jun 24, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 26, 2024

mattahrens assigned SurajAralihalli Jul 1, 2024

SurajAralihalli mentioned this issue Jul 9, 2024

Support MapFromArrays on GPU [databricks] #11163

Merged

SurajAralihalli closed this as completed in #11163 Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out why `MapFromArrays` appears in the tests for hive parquet write #10948

Figure out why `MapFromArrays` appears in the tests for hive parquet write #10948

firestarman commented May 30, 2024 •

edited

Loading

firestarman commented Jun 3, 2024

revans2 commented Jun 17, 2024

mattahrens commented Jun 18, 2024

firestarman commented Jun 19, 2024 •

edited

Loading

revans2 commented Jun 24, 2024

revans2 commented Jun 24, 2024

Figure out why MapFromArrays appears in the tests for hive parquet write #10948

Figure out why MapFromArrays appears in the tests for hive parquet write #10948

Comments

firestarman commented May 30, 2024 • edited Loading

firestarman commented Jun 3, 2024

revans2 commented Jun 17, 2024

mattahrens commented Jun 18, 2024

firestarman commented Jun 19, 2024 • edited Loading

revans2 commented Jun 24, 2024

revans2 commented Jun 24, 2024

Figure out why `MapFromArrays` appears in the tests for hive parquet write #10948

Figure out why `MapFromArrays` appears in the tests for hive parquet write #10948

firestarman commented May 30, 2024 •

edited

Loading

firestarman commented Jun 19, 2024 •

edited

Loading