Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for writing deletion vectors in Delta Lake 2.4 #8674

Closed

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Jul 7, 2023

Closes #8554

Changes in this PR:

  • Stop falling back to CPU in GpuDeleteCommand when deletion vectors are enabled
  • Implement logic from OSS Delta Lake 2.4 for writing deletion vectors. Note that we are not necessarily GPU-accelerating anything here but just matching the behavior of OSS Delta Lake
  • Add test to ensure behavior is correct
  • Update Delta Log comparison to handle deletionVector structures

Note that GPU-accelerating the metadata queries involved will not be trivial due to row-based UDFs, custom data types, and roaring bitmap aggregation operators, which we do not support on GPU.

@andygrove andygrove self-assigned this Jul 7, 2023
@andygrove andygrove force-pushed the delta-lake-deletion-vector branch 2 times, most recently from 994ee73 to 1221a0e Compare July 8, 2023 00:10
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

test

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

Databricks build failing due to #8726

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title WIP: Add support for writing deletion vectors in Delta Lake 2.4 Add support for writing deletion vectors in Delta Lake 2.4 Jul 19, 2023
@andygrove andygrove marked this pull request as ready for review July 19, 2023 17:29
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build failed with seemingly unrelated issue:

FileCacheIntegrationSuite:^[[0m
- filecache metrics v1 Parquet^[[0m
- filecache metrics v2 Parquet^[[0m
- filecache metrics v1 ORC *** FAILED ***^[[0m
  0 was not greater than 0 (FileCacheIntegrationSuite.scala:175)

@andygrove
Copy link
Contributor Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel like we're properly supporting deletion vectors if we're sending the bulk of the data through a CPU UDF. That appears to be the case after glancing at what DeleteWithDeletionVectorsHelper.findTouchedFiles does.

Also what about spark332db which also supports deletion vectors?

@jlowe
Copy link
Member

jlowe commented Jul 24, 2023

Note that we've built GPU versions of Delta Lake CPU UDFs before, and we should be able to do something similar here if necessary, although I haven't scoped the effort.

If we've done benchmarking showing that partially supporting deletion vectors, (i.e.: falling back to the CPU to compute the vector values before writing via GPU), significantly outperforms falling back to the CPU to do the entire delete operation, then this would be worth committing.

@andygrove
Copy link
Contributor Author

Also what about spark332db which also supports deletion vectors?

I'm working on that as a separate PR since it is much more involved. The tracking issue is #8654

@andygrove
Copy link
Contributor Author

Note that we've built GPU versions of Delta Lake CPU UDFs before, and we should be able to do something similar here if necessary, although I haven't scoped the effort.

Af first glance, it does not look trivial to implement on the GPU, since it looks like we would need to support roaring bitmap format vectors.

If we've done benchmarking showing that partially supporting deletion vectors, (i.e.: falling back to the CPU to compute the vector values before writing via GPU), significantly outperforms falling back to the CPU to do the entire delete operation, then this would be worth committing.

We have not benchmarked this. I will move this PR to draft for now but perhaps this should just be closed until we decide to fully GPU-accelerate the deletion vector writes.

@andygrove andygrove marked this pull request as draft July 26, 2023 19:18
@andygrove andygrove closed this Aug 1, 2023
@andygrove andygrove deleted the delta-lake-deletion-vector branch January 2, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] [Delta Lake] Add support for deletion vectors in OSS Delta Lake
3 participants