Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] #6257

Merged
merged 7 commits into from
Aug 10, 2022

Conversation

andygrove
Copy link
Contributor

When performing a Delta Lake MERGE operation on Databricks, a query runs that relies on receiving valid data from computed columns such as _databricks_internal_edge_computed_column_row_index and _databricks_internal_edge_computed_column_skip_row in the Parquet read.

For example:

+- FileScan parquet [country#755,_databricks_internal_edge_computed_column_row_index#963L] Batched: true, DataFilters: [], Format: Parquet, Location: TahoeBatchFileIndex(1 paths)[file:/tmp/myth/delta/corpus], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<country:string,_databricks_internal_edge_computed_column_row_index:bigint>

These columns obviously do not exist if we are reading the files on GPU so we need to fall back to CPU in this case.

This PR is still WIP because there are no tests yet.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added the bug Something isn't working label Aug 8, 2022
@andygrove andygrove added this to the Aug 8 - Aug 19 milestone Aug 8, 2022
@andygrove andygrove self-assigned this Aug 8, 2022
Comment on lines +4434 to +4437
case f: FileSourceScanExec if f.requiredSchema.fields
.exists(_.name.startsWith("_databricks_internal")) =>
logDebug(s"Fallback for FileSourceScanExec with _databricks_internal: $f")
true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at adding a shim so we only apply this check on Databricks

@mythrocks
Copy link
Collaborator

👍 I have verified that this change allows for the MERGE command to fall back to CPU mode.
(Note that I tested this change alongside #6230.)

I'm inclined to call this "Plan A", instead of it being the fallback position for 22.08. The _databricks_internal_ columns might have implementation-specific connotations (for computing cardinality, etc.). It might be best not to attempt GPU acceleration for it right now.

We have verified that run-of-the-mill DELTA table reads are not affected: only the MERGE query is falling back.

@andygrove andygrove changed the title WIP: Fallback to CPU for Parquet reads with _databricks_internal columns WIP: Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] Aug 9, 2022
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title WIP: Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] Fallback to CPU for Parquet reads with _databricks_internal columns [databricks] Aug 10, 2022
@andygrove
Copy link
Contributor Author

I still plan on making the shim change, but tests are now passing, and this could be merged as is for 22.08 with a follow-up issue to use shims in 22,10.

@andygrove
Copy link
Contributor Author

I filed #6279 for the follow-up issue

@andygrove andygrove merged commit f3f6bab into NVIDIA:branch-22.08 Aug 10, 2022
@andygrove andygrove deleted the fallback-db-internal branch August 10, 2022 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants