-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication support in RewriteDataFilesAction #2764
Comments
@jerryshao @rdblue @aokolnychyi Can you please share your thoughts? |
In the new rewriteDataFIles, we could add this as a new strategy ... |
Thanks @RussellSpitzer for getting back on this. Can you please point me to any doc/pr describing about this new rewriteDataFiles? Thanks! |
I think that we would need to be careful about this. Deduplication makes sense, but it is an overwrite and not a replace operation because it modifies the data in the table. Compaction is a pure replace that does not modify the table data. We should keep the two separate because replace operations can be safely ignored when looking for change to the table data. |
[Deleted my previous comment, hadn't tried out overwrite earlier] |
Bulk operations are difficult to propagate as mutations. Most of the early incremental consumption focuses on the easy case, appends, for that reason. But there are ways to make it work. For example, you could read all the deleted files and all the added files in an overwrite and use a full outer join to label each row deleted, added, or kept and then feed those rows into incremental processing. That join is expensive, though. Another strategy is to read just the added files and consider all of the changes sort of an I think the easier solution is turning a row-delta commit into changes because you can read the deleted rows and added rows directly without needing to process (and usually discard) the kept rows. |
Thank you for the detailed comments @rdblue , issue 2782 seems to be tracking the delete/overwrite mutations already, wasn't aware of it earlier. |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
RewriteDataFilesAction helps to compact the smaller files into bigger files by configuring the targetSizeInBytes. If we take the example of compacting all the files post the day closure, it is possible that duplicate events could have arrived at different point of time over the day duration. Compaction with deduplication will be useful.
Deduplicate could be an additional method in RewriteDataFilesAction with default behaviour turned off.
The text was updated successfully, but these errors were encountered: