Fallback to CPU for Parquet reads with `_databricks_internal` columns [databricks] #6257

andygrove · 2022-08-08T22:25:17Z

When performing a Delta Lake MERGE operation on Databricks, a query runs that relies on receiving valid data from computed columns such as _databricks_internal_edge_computed_column_row_index and _databricks_internal_edge_computed_column_skip_row in the Parquet read.

For example:

+- FileScan parquet [country#755,_databricks_internal_edge_computed_column_row_index#963L] Batched: true, DataFilters: [], Format: Parquet, Location: TahoeBatchFileIndex(1 paths)[file:/tmp/myth/delta/corpus], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<country:string,_databricks_internal_edge_computed_column_row_index:bigint>

These columns obviously do not exist if we are reading the files on GPU so we need to fall back to CPU in this case.

This PR is still WIP because there are no tests yet.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2022-08-08T22:29:39Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+      case f: FileSourceScanExec if f.requiredSchema.fields
+         .exists(_.name.startsWith("_databricks_internal")) =>
+        logDebug(s"Fallback for FileSourceScanExec with _databricks_internal: $f")
+        true


I will look at adding a shim so we only apply this check on Databricks

mythrocks · 2022-08-08T23:09:58Z

👍 I have verified that this change allows for the MERGE command to fall back to CPU mode.
(Note that I tested this change alongside #6230.)

I'm inclined to call this "Plan A", instead of it being the fallback position for 22.08. The _databricks_internal_ columns might have implementation-specific connotations (for computing cardinality, etc.). It might be best not to attempt GPU acceleration for it right now.

We have verified that run-of-the-mill DELTA table reads are not affected: only the MERGE query is falling back.

…nternal

andygrove · 2022-08-09T18:59:40Z

build

andygrove · 2022-08-09T21:45:21Z

build

andygrove · 2022-08-10T00:48:36Z

I still plan on making the shim change, but tests are now passing, and this could be merged as is for 22.08 with a follow-up issue to use shims in 22,10.

andygrove · 2022-08-10T15:48:59Z

I filed #6279 for the follow-up issue

Fallback to CPU for Parquet reads with _databricks_internal columns

77d8848

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove added the bug Something isn't working label Aug 8, 2022

andygrove added this to the Aug 8 - Aug 19 milestone Aug 8, 2022

andygrove self-assigned this Aug 8, 2022

andygrove commented Aug 8, 2022

View reviewed changes

andygrove added 4 commits August 9, 2022 12:17

Add integration test

a3fada9

fix import

1c884af

fix test

a10c37c

Merge remote-tracking branch 'nvidia/branch-22.08' into fallback-db-i…

6326b87

…nternal

andygrove changed the title ~~WIP: Fallback to CPU for Parquet reads with _databricks_internal columns~~ WIP: Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] Aug 9, 2022

andygrove added 2 commits August 9, 2022 15:14

fix test

937287d

test passes

a1b2908

andygrove changed the title ~~WIP: Fallback to CPU for Parquet reads with _databricks_internal columns [databricks]~~ Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] Aug 10, 2022

jlowe approved these changes Aug 10, 2022

View reviewed changes

andygrove mentioned this pull request Aug 10, 2022

[FEA] Move _databricks_internal check to shim layer #6279

Closed

andygrove merged commit f3f6bab into NVIDIA:branch-22.08 Aug 10, 2022

andygrove deleted the fallback-db-internal branch August 10, 2022 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback to CPU for Parquet reads with `_databricks_internal` columns [databricks] #6257

Fallback to CPU for Parquet reads with `_databricks_internal` columns [databricks] #6257

andygrove commented Aug 8, 2022

andygrove Aug 8, 2022

mythrocks commented Aug 8, 2022

andygrove commented Aug 9, 2022

andygrove commented Aug 9, 2022

andygrove commented Aug 10, 2022

andygrove commented Aug 10, 2022

Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] #6257

Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] #6257

Conversation

andygrove commented Aug 8, 2022

andygrove Aug 8, 2022

Choose a reason for hiding this comment

mythrocks commented Aug 8, 2022

andygrove commented Aug 9, 2022

andygrove commented Aug 9, 2022

andygrove commented Aug 10, 2022

andygrove commented Aug 10, 2022

Fallback to CPU for Parquet reads with `_databricks_internal` columns [databricks] #6257

Fallback to CPU for Parquet reads with `_databricks_internal` columns [databricks] #6257