Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] file reads in a background thread with flow control. #11345

Open
revans2 opened this issue Aug 16, 2024 · 0 comments
Open

[FEA] file reads in a background thread with flow control. #11345

revans2 opened this issue Aug 16, 2024 · 0 comments
Labels
feature request New feature or request performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 16, 2024

Is your feature request related to a problem? Please describe.
This is intended to be a lot like #11344, but for parquet, orc, avro, csv, json,... files

This is intended to be a follow on to #1815

Essentially it would be really nice to release the semaphore less by trying to buffer more file data. We can use a similar model to how the shuffle readers are described in #11344. But things get complicated because parquet and orc already have ways of kind of doing this. Not the read ahead, but using multiple threads to read the data. Also it is not that often that we end up needing to pull in two batches of input data. So this is probably lower priority compared to the shuffle. It is also a lot of work if we want to try and match what the parquet and orc readers are doing today. But it should be doable and should give us prioritization and flow control for these as well.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Aug 16, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants