Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot) #2481

szehon-ho · 2021-04-15T15:14:22Z

Had a use case to experiment on the side with a certain table snapshot (do some modifications), but didn't want to alter the table's history.

I think 'snapshot' command will be very useful here. We can quickly generate a separate table metadata pointing to the snapshot, instead of copying all the data into a side table for the experiments.

If Iceberg table is source, snapshot procedure can use latest table snapshot, or also potentially take a snapshotId as an argument.

Was chatting with @RussellSpitzer and the only potential problem is that if you expire original table's snapshot and remove orphan files, then the new table cannot not be able to be read. But it is the same problem as snapshotting a Hive table (dropping some files on original table will corrupt the new table).

szehon-ho · 2021-04-15T15:16:46Z

cc @aokolnychyi for any thoughts, potential issues

szehon-ho · 2021-04-15T15:52:31Z

Actually taking a look through the code, this use case can be probably be solved by SnapshotUpdate.stageOnly() flag, taking a look.

aokolnychyi · 2021-05-05T21:21:26Z

I remember @rdblue @danielcweeks @Parth-Brahmbhatt mentioning similar attempts and that there were some issues but I don't recall any details.

pvary · 2021-05-10T15:04:35Z

We were considering enhancing HiveCatalog to point to a specific snapshot of an Iceberg table.
This could be useful if we want to share a specific version of the table but still want to continue adding more data to the table, but a snapshot table might also solve this problem.

szehon-ho · 2021-05-10T22:54:42Z

Yea , snapshotting an Iceberg table would be a great usability feature for sharing a certain snapshot of a table to others.

The stageOnly() API probably works for this but is not well known, and hard to expose through Spark/Hive to be shareable to other users.

One risk would be a user running snapshot on an Iceberg table, then dropping it thinking they are only dropping the snapshot.
If I'm not mistaken, dropping the table through the Sparkcatalog (purge=true) will drop all the current data of the original table (a general problem of the snapshot command).

jackye1995 · 2021-05-18T03:52:52Z

Yeah this is a common case that I also see many people trying to achieve through Iceberg.

For exposing to Spark and Hive, would view help? We can create "snapshot view" for people to query, for example:

CREATE VIEW my_table_2020 AS SELECT * FROM my_table@1234567890;

Then dropping the view would not affect anything of the underlying snapshot.

jzhuge · 2021-05-18T04:29:57Z

If Iceberg's TableCatalog also supports ViewCatalog, we may be able to create this "snapshot" view on the fly.

pvary · 2021-05-18T10:43:37Z

@jackye1995: In Hive currently there is no way to parse my_table@1234567890. I suspect that in Spark this would mean something like my_table with snapshotId=1234567890. Am I right?

@jzhuge: What is ViewCatalog? Is it a Spark interface we should implement?

jackye1995 · 2021-05-21T22:43:36Z

@pvary yeah sorry I am making up this syntax because something similar to this exists in delta lake. So it is definitely doable in Spark, but for Hive yes it is going to be hard to add extensions like this.

rdblue · 2021-05-21T22:52:16Z

For this use case, we may want to consider adding git-like branches and tags instead. I think it would be cleaner to branch and then update the branch. Then you'd be able to stay within a table and reuse data files more cleanly. Sharing files across independent tables has a lot of problems.

That would also have a cleaner syntax: catalog.db.table.branch. That would prevent us from using some branch names, like files but I think overall it would be okay.

jackye1995 · 2021-05-22T01:59:45Z

we may want to consider adding git-like branches and tags instead

@rdblue yes that would be ideal, but Nessie is trying to achieve this git-like experience. Currently it seems like people do want to continue using their catalog and also have that experience. Was there any discussion about this conflict of interest?

rdblue · 2021-05-23T23:25:27Z

I've talked with @rymurr about the way that Nessie currently works and I think we generally agree that we would want to change it to use Iceberg-native branching and tagging.

The problem with Nessie's current model is that it keeps references to multiple metadata files instead of tracking everything in one place. That means:

We have to coordinate across metadata file versions even though Iceberg assumes that you don't do that: for example, that breaks the file cleanup assumptions because we compare the files that are reachable from all snapshots.
Changes that shouldn't be part of transactions may change between branches. For example, if you add a column in a branch and write data, you will have assigned a new ID and used it in a data file. If you did that in two branches in parallel, you'd use the same ID for two different columns. It may appear safe to merge the metadata trees, but it actually isn't because that would mix column data together.

We can fix those issues with Iceberg-native branching and tagging. I think that's the right option for use cases where you want to branch from current tables for testing purposes.

szehon-ho · 2021-05-25T12:16:17Z

That would also have a cleaner syntax: catalog.db.table.branch. That would prevent us from using some branch names, like files but I think overall it would be okay.

Seems like a nice way to expose staged-snapshots. We could expose metadata of this branch as well, like catalog.db.table.branch.files?

And I suppose Hive can support this easily as well then, if it's already able to parse metadata tables.

pvary · 2021-05-25T13:26:55Z

I also like the idea of branching / tagging tables.

And I suppose Hive can support this easily as well then, if it's already able to parse metadata tables.

AFAIK we do not have a way to expose table metadata ATM. I am still not sure what would be the best way to allow searching for snapshots etc.

If I remember correctly then Expedia had a way to create specific Hive tables where the schema was the SNAPSHOT_SCHEMA and the content was the list of the snapshots, but that required the user to create a second table just to query the metadata. @massdosage might know more (but this could be a different topic)

jackye1995 · 2021-05-25T17:47:54Z

I think we generally agree that we would want to change it to use Iceberg-native branching and tagging

That would be great! I also have a few requests on my side regarding this feature.

AFAIK we do not have a way to expose table metadata ATM

+1, creating an overlay for metadata table is a feasible workaround but is quite inconvenient. I am also interested in knowing if there is any good way to achieve that in Hive, but so far I don't see a way to do it without adding more hooks in Hive.

rymurr · 2021-05-26T08:58:17Z

Hey all, sorry for being late to the party.

Ive asked @rdblue if we can discuss this in more detail at the next sync up, hopefully everyone can join for a brainstorm session. Would be great to make branches and tags first class citizens in Iceberg!

One thing I would like to solve is how to efficiently sync branches/tags across multiple tables. The git-like model in nessie makes this trivial as all tables in the catalog are included in a nessie 'commit'. I am not sure how we can do this directly in iceberg efficiently but I think it is important.

I've proposed an interface here #2304 which adds branch/tag support to catalogs and am currently experimenting w/ aligning nessie commits closer to iceberg snapshots to deal w/ the issue @rdblue described above.

YehorKrivokon · 2024-01-05T11:39:17Z

Hi all,
I'm trying to use Iceberg with Spark and SparkCatalog and I've a repro of this issue.
Is there any workaround to work with snapshots?

Thank you.

github-actions · 2024-07-04T00:12:26Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

RussellSpitzer mentioned this issue May 6, 2021

Failed to snapshot iceberg table #2555

Closed

danielcweeks added this to To do in [Priority 2] Spec: Snapshot tagging and branching via automation Sep 1, 2021

github-actions bot added the stale label Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot) #2481

Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot) #2481

szehon-ho commented Apr 15, 2021 •

edited

Loading

szehon-ho commented Apr 15, 2021

szehon-ho commented Apr 15, 2021 •

edited

Loading

aokolnychyi commented May 5, 2021

pvary commented May 10, 2021

szehon-ho commented May 10, 2021

jackye1995 commented May 18, 2021

jzhuge commented May 18, 2021

pvary commented May 18, 2021

jackye1995 commented May 21, 2021

rdblue commented May 21, 2021

jackye1995 commented May 22, 2021

rdblue commented May 23, 2021

szehon-ho commented May 25, 2021

pvary commented May 25, 2021 •

edited

Loading

jackye1995 commented May 25, 2021

rymurr commented May 26, 2021

YehorKrivokon commented Jan 5, 2024

github-actions bot commented Jul 4, 2024

Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot) #2481

Allow snapshotting iceberg table (create new table based on certain Iceberg snapshot) #2481

Comments

szehon-ho commented Apr 15, 2021 • edited Loading

szehon-ho commented Apr 15, 2021

szehon-ho commented Apr 15, 2021 • edited Loading

aokolnychyi commented May 5, 2021

pvary commented May 10, 2021

szehon-ho commented May 10, 2021

jackye1995 commented May 18, 2021

jzhuge commented May 18, 2021

pvary commented May 18, 2021

jackye1995 commented May 21, 2021

rdblue commented May 21, 2021

jackye1995 commented May 22, 2021

rdblue commented May 23, 2021

szehon-ho commented May 25, 2021

pvary commented May 25, 2021 • edited Loading

jackye1995 commented May 25, 2021

rymurr commented May 26, 2021

YehorKrivokon commented Jan 5, 2024

github-actions bot commented Jul 4, 2024

szehon-ho commented Apr 15, 2021 •

edited

Loading

szehon-ho commented Apr 15, 2021 •

edited

Loading

pvary commented May 25, 2021 •

edited

Loading