Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication support in RewriteDataFilesAction #2764

Closed
skandasa23 opened this issue Jun 30, 2021 · 10 comments
Closed

Deduplication support in RewriteDataFilesAction #2764

skandasa23 opened this issue Jun 30, 2021 · 10 comments
Labels

Comments

@skandasa23
Copy link

RewriteDataFilesAction helps to compact the smaller files into bigger files by configuring the targetSizeInBytes. If we take the example of compacting all the files post the day closure, it is possible that duplicate events could have arrived at different point of time over the day duration. Compaction with deduplication will be useful.
Deduplicate could be an additional method in RewriteDataFilesAction with default behaviour turned off.

@skandasa23
Copy link
Author

@jerryshao @rdblue @aokolnychyi Can you please share your thoughts?

@RussellSpitzer
Copy link
Member

In the new rewriteDataFIles, we could add this as a new strategy ...

@skandasa23
Copy link
Author

Thanks @RussellSpitzer for getting back on this. Can you please point me to any doc/pr describing about this new rewriteDataFiles? Thanks!

@RussellSpitzer
Copy link
Member

#2501 - #2585 - #2591 - #2379

@rdblue
Copy link
Contributor

rdblue commented Jul 5, 2021

I think that we would need to be careful about this. Deduplication makes sense, but it is an overwrite and not a replace operation because it modifies the data in the table. Compaction is a pure replace that does not modify the table data. We should keep the two separate because replace operations can be safely ignored when looking for change to the table data.

@skandasa23
Copy link
Author

[Deleted my previous comment, hadn't tried out overwrite earlier]
Thank you @rdblue for the comment, makes sense to mark the operation as overwrite.
I'd a look at the code here, for incremental downstream consumers only "append" snapshots are being considered , delete is ignored and overwrite results in exception.
Is there a plan to propagate the mutations of delete/overwrite operations as well in future? @rdsr thoughts?

@rdblue
Copy link
Contributor

rdblue commented Jul 30, 2021

Bulk operations are difficult to propagate as mutations. Most of the early incremental consumption focuses on the easy case, appends, for that reason. But there are ways to make it work. For example, you could read all the deleted files and all the added files in an overwrite and use a full outer join to label each row deleted, added, or kept and then feed those rows into incremental processing. That join is expensive, though.

Another strategy is to read just the added files and consider all of the changes sort of an upsert operation. As long as you know that all of the data coming in replaces 0 or 1 rows, then you may not need to know what the previous row was. The problem is that this makes assumptions about the operation that happened and doesn't apply to all cases. For your deduplication case, it wouldn't tell you when duplicate rows are removed.

I think the easier solution is turning a row-delta commit into changes because you can read the deleted rows and added rows directly without needing to process (and usually discard) the kept rows.

@skandasa23
Copy link
Author

Thank you for the detailed comments @rdblue , issue 2782 seems to be tracking the delete/overwrite mutations already, wasn't aware of it earlier.
row-delta commit - guess you are talking about merge on read semantics?
I'm newer to Iceberg and catching up on newer rewrite commits. Will check those and get back on the approach for deduplication support while compaction.

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Jun 18, 2024
Copy link

github-actions bot commented Jul 3, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants