-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: Cache deletes on executors #10666
base: main
Are you sure you want to change the base?
Conversation
@@ -335,7 +373,13 @@ private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) { | |||
case HIVE: | |||
return openTask(currentTask, readSchema); | |||
case GENERIC: | |||
DeleteFilter deletes = new GenericDeleteFilter(io, currentTask, tableSchema, readSchema); | |||
DeleteFilter deletes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case can the cache be used to speed up task execution?
I mean that one tez task gets the deletes only once, so when is the next time the cache deletes would be used?
But I noticed that the ObjectCache can be useful when container is reused, see https://issues.apache.org/jira/browse/HIVE-5151, so if tez task for iceberg is reused iceberg.mr.reuse.containers
, can the cache work well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- tez task could process multiple files in a split (
tasks
), seeIcebergRecordReader#nextTask
- main use-case is LLAP daemons, that could reuse the object cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg.mr.reuse.containers
is a different thing, has nothing to do with executors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This is a good change for the long running execution service like LLAP.
Thanks for explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, The GENERIC
Type of InMemoryDataModel
is main for non-vectorized execution. Does this mean vectorized execution HIVE
Type can not use the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectorization support for deletes is not present in the upstream iceberg (only upstream Hive), we could raise PR for that if we find someone who can help with review and commit, but based on this discussion https://lists.apache.org/thread/dy391h0xtnqol0fs3rxpz769446tqbrk it looks like a lost cause.
@pvary, do you have any insight on Hive connector's future :)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The community plans to upgrade the Java version, which is reasonable at this time. Java 8 support is long gone.
If Hive plans to support Iceberg 1.7 or later, then Hive needs to upgrade the Java version.
If there is a Hive version which doesn't block the Java upgrade, then we could suggest to upgrade Hive instead of dropping the support. I'm not sure that the community will support the notion, but I would prefer to keep a few connectors inside the Iceberg repo to ensure long-term compatibility.
Cache delete files on Hive executors using ObjectCache