Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: Cache deletes on executors #10666

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Jul 9, 2024

Cache delete files on Hive executors using ObjectCache

@github-actions github-actions bot added the MR label Jul 9, 2024
@@ -335,7 +373,13 @@ private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
case HIVE:
return openTask(currentTask, readSchema);
case GENERIC:
DeleteFilter deletes = new GenericDeleteFilter(io, currentTask, tableSchema, readSchema);
DeleteFilter deletes =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case can the cache be used to speed up task execution?

I mean that one tez task gets the deletes only once, so when is the next time the cache deletes would be used?

But I noticed that the ObjectCache can be useful when container is reused, see https://issues.apache.org/jira/browse/HIVE-5151, so if tez task for iceberg is reused iceberg.mr.reuse.containers, can the cache work well?

Copy link
Member Author

@deniskuzZ deniskuzZ Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. tez task could process multiple files in a split (tasks), see IcebergRecordReader#nextTask
  2. main use-case is LLAP daemons, that could reuse the object cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg.mr.reuse.containers is a different thing, has nothing to do with executors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is a good change for the long running execution service like LLAP.
Thanks for explanation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, The GENERIC Type of InMemoryDataModel is main for non-vectorized execution. Does this mean vectorized execution HIVE Type can not use the cache?

Copy link
Member Author

@deniskuzZ deniskuzZ Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorization support for deletes is not present in the upstream iceberg (only upstream Hive), we could raise PR for that if we find someone who can help with review and commit, but based on this discussion https://lists.apache.org/thread/dy391h0xtnqol0fs3rxpz769446tqbrk it looks like a lost cause.
@pvary, do you have any insight on Hive connector's future :)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community plans to upgrade the Java version, which is reasonable at this time. Java 8 support is long gone.

If Hive plans to support Iceberg 1.7 or later, then Hive needs to upgrade the Java version.

If there is a Hive version which doesn't block the Java upgrade, then we could suggest to upgrade Hive instead of dropping the support. I'm not sure that the community will support the notion, but I would prefer to keep a few connectors inside the Iceberg repo to ensure long-term compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants