Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-4678][RFC-61] RFC for Snapshot view management #6576

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[HUDI-4678] initial RFC
  • Loading branch information
jian.feng committed Nov 20, 2022
commit 113ee82992cda560212cdf5f3496c0706ebde604
Binary file added rfc/rfc-61/img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfc/rfc-61/img_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 33 additions & 4 deletions rfc/rfc-61/rfc-61.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,6 @@ JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)

## Abstract

Describe the problem you are trying to solve and a brief description of why it’s needed

For the snapshot view scenario, Hudi already provides two key features to support it:
* Time travel: user provides a timestamp to query a specific snapshot view of a Hudi table
* Savepoint/restore: "savepoint" saves the table as of the commit time so that it lets you restore the table to this savepoint at a later point in time if need be.
Expand All @@ -50,8 +48,39 @@ What this RFC plan to do is to let Hudi support release a snapshot view and life

## Background
Introduce any much background context which is relevant or necessary to understand the feature and design choices.
typical scenarios of snapshot view:

typical scenarios and benefits of snapshot view:
1. Basic idea:
fengjian428 marked this conversation as resolved.
Show resolved Hide resolved
![img.png](img.png)

Create Snapshot view based on Hudi Savepoint
* Create Snapshot views periodically by time(date time/processing time)
* Use HMS to store view metadata

Build periodic snapshots based on the time period required by the user
These Shapshots are stored as partitions in the metadata management system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't quite get the "stored as partitions" - do you mean each snapshot's info is saved in a partition of a some metadata table in the catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, snapshots should be stored as tables, will correct it

Users can easily use SQL to access this data in Flink Spark or Presto.
Because the data store is complete and has no merged details,
So the data itself is to support the full amount of data calculation, also support incremental processing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this incremental processing of snapshot views differ from the existing incremental processing of the hudi table itself? is it intended for bigger incremental pull window? if this understanding is correct, then in btw the snapshots, there will be missing original commits with changed data. not sure how practical this is.

Copy link
Contributor Author

@fengjian428 fengjian428 Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, users can only get one snapshot per savepoint, I think it may satisfy some SCD scenarios. WDYT? @xushiyan


2. Compare to Hive solution
![img_2.png](img_2.png)

The Snapshot view is created based on Hudi Savepoint, which significantly reduces the data storage space of some large tables
* The space usage becomes (1 + (t-1) * p)/t
* Incremental use reduces the amount of data involved in the calculation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same formatting issue


When using snapshot view storage, for some scenarios where the proportion of changing data is not large, a better storage saving effect will be achieved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

savepoint commit will record all base files at that point of time and those files will be retained in the hudi table. so it's still the full data at that point. what storage saving is this compared against?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if some base files have not changed between two savepoints, these two savepoints can share base files instead of retain two full data

We have a simple formula here to calculate the effect
P indicates the proportion of changed data, and t indicates the number of time periods to be saved
The lower the percentage of changing data, the better the storage savings
So There is also a good savings for long periods of data

At the same time, it has benefit for incremental computing resource saving

3. Typical scenarios
* Time travel for a long time in a convenience way
* More flexible pipeline schedule&execution

## Implementation
Describe the new thing you want to do in appropriate detail, how it fits into the project architecture.
Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own.
Expand Down