Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Defer reading Avro metadata until ManifestFile is read #5206

Merged
merged 1 commit into from
Jul 6, 2022

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 5, 2022

This improves job planning performance by moving ManifestFiles.read setup into the ParallelIterator that is used to plan tasks. ParallelIterator accepts an Iterable of CloseableIterable. The outer iterable is iterated over to submit tasks that run in the worker pool. In ManifestGroup, the Iterable that was returned would call ManifestFiles.read to prepare the inner iterable, but the ManifestReader needs to read Avro file metadata and will open a stream. That initial file open was running in the consuming thread as tasks were submitted, instead of in the worker pool.

This updates ManifestGroup to use a custom Iterable that defers calling ManifestFiles.read until the inner iterable is used.

@lirui-apache
Copy link
Contributor

We encountered authorization failures reading manifest files after applied this PR, and thought it might be related. Since the worker pool in use is by default a global static pool, the threads in this pool may not be able to see the correct UGI submitting the operation. And hit permission denied errors when trying to open the stream. I wonder whether we should call doAs in the pool, or whether we should let different users use separate pools?

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants