Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Delta Lake 2.4.0 [databricks] #8573

Merged
merged 46 commits into from
Jul 8, 2023

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Jun 13, 2023

Part of #8547

Closes #8556

This PR adds shims for Delta Lake 2.4.0, which is only compatible with Spark 3.4.x

The approach follows the current approach for Delta Lake shims, where each version is largely the same code, with minor differences. This makes it easier to diff against the Delta Lake project.

I started by copying the 22x shims as the basis for 24x, then performed a diff with the Delta Lake 2.4 source and pulled in some changes.

The scope of this PR is to get current Delta Lake tests passing with 2.4, and does not aim to support all new features in Delta Lake 2.4 on the GPU.

Delta Lake 2.4.0 release notes: https://github.com/delta-io/delta/releases/tag/v2.4.0

Follow-on issues:

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove self-assigned this Jun 13, 2023
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review June 13, 2023 21:56
@andygrove andygrove changed the title WIP: Add support for Delta Lake 2.4.0 Add support for Delta Lake 2.4.0 Jun 13, 2023
@sameerz sameerz added the feature request New feature or request label Jun 14, 2023
@andygrove andygrove marked this pull request as draft June 14, 2023 15:29
@andygrove andygrove changed the title Add support for Delta Lake 2.4.0 WIP: Add support for Delta Lake 2.4.0 Jun 14, 2023
@andygrove
Copy link
Contributor Author

Moving this back to draft while I audit differences to Delta Lake implementation

Comment on lines 62 to 68
for k in "executionTimeMs", "numOutputBytes", "rewriteTimeMs", "scanTimeMs", \
"numRemovedBytes", "numAddedBytes", "numTargetBytesAdded", "numTargetBytesInserted", \
"numTargetBytesUpdated", "numTargetBytesRemoved", \
'numTargetRowsMatchedUpdated', \
'numTargetRowsMatchedDeleted', \
'numTargetRowsNotMatchedBySourceUpdated', \
'numTargetRowsNotMatchedBySourceDeleted':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for easier maintenance in the future, I would create a separate variable with the list of metrics one per line,

metrics_to_remove = [
  'executionTimeMs', 
  'numOutputBytes', 
  ...
]

nit: I'd stick to some consistent use of quotes or double-quotes in a single code fragment at least

andygrove and others added 3 commits July 5, 2023 07:21
…lta/delta24x/DeleteCommandMeta.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…/rapids/delta24x/GpuMergeIntoCommand.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

jlowe
jlowe previously approved these changes Jul 6, 2023
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

Integration tests are failing on Databricks 12.2 due to GC pauses causing timeouts. I can reproduce manually by running delta_lake_merge_test.py. Here are examples of warnings that I see:

23/07/07 18:18:33 WARN DBRDebuggerEventReporter: Driver/10.59.237.142 paused the JVM process 110 seconds during the past 120 seconds (92.09%) because of GC. We observed 105 such issue(s) since 2023-07-07T18:13:26.473Z.

23/07/07 18:20:56 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.

@jlowe
Copy link
Member

jlowe commented Jul 7, 2023

We had problems in the past due to aggressive caching of Delta Log snapshots. Wonder if the same thing or something similar is happening. We should be specifying -Ddelta.log.cacheSize=10 during the tests, make sure that's getting through. Could get a heap dump, maybe there's another cache that needs a limit?

@andygrove
Copy link
Contributor Author

It looks like I went too far when refactoring setup_dest_tables. I have reverted some of those changes and verified that delta_lake_merge_test and delta_lake_delete_test now run without issue on DBR 12.2

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 34f09ae into NVIDIA:branch-23.08 Jul 8, 2023
24 checks passed
@andygrove andygrove deleted the delta-lake-24 branch July 8, 2023 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] [Delta Lake] Add support for new metrics in MERGE
4 participants