Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Better handling of reading lots of small Parquet files #333

Closed
tgravescs opened this issue Jul 8, 2020 · 2 comments · Fixed by #595
Closed

[FEA] Better handling of reading lots of small Parquet files #333

tgravescs opened this issue Jul 8, 2020 · 2 comments · Fixed by #595
Assignees
Labels
feature request New feature or request performance A performance related task/issue

Comments

@tgravescs
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

Currently the readers (parquet for example) performs badly when there are a lot of small files. We perform much better at a small number of large files. We should improve performance of reading small files.

@tgravescs tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 8, 2020
@tgravescs
Copy link
Collaborator Author

note that one issue with this is Spark has a feature where you can ask what the filename is that you are reading. Perhaps we can recognize when that feature is being used.

@tgravescs tgravescs changed the title [FEA] Better handling of lots of small files [FEA] Better handling of reading lots of small files Jul 8, 2020
@sameerz sameerz added the performance A performance related task/issue label Jul 8, 2020
@tgravescs tgravescs self-assigned this Jul 27, 2020
@tgravescs tgravescs added this to the Aug 3 - Aug 14 milestone Aug 4, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 4, 2020
@tgravescs
Copy link
Collaborator Author

Note, we need separate issues for Parquet, Orc, etc.. all the ones we want to handle. I am concentrating on Parquet first.

@tgravescs tgravescs changed the title [FEA] Better handling of reading lots of small files [FEA] Better handling of reading lots of small Parquet files Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants