[FEA] Better handling of reading lots of small Parquet files #333

tgravescs · 2020-07-08T19:10:11Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

Currently the readers (parquet for example) performs badly when there are a lot of small files. We perform much better at a small number of large files. We should improve performance of reading small files.

tgravescs · 2020-07-08T19:10:35Z

note that one issue with this is Spark has a feature where you can ask what the filename is that you are reading. Perhaps we can recognize when that feature is being used.

tgravescs · 2020-08-11T19:07:55Z

Note, we need separate issues for Parquet, Orc, etc.. all the ones we want to handle. I am concentrating on Parquet first.

tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 8, 2020

tgravescs changed the title ~~[FEA] Better handling of lots of small files~~ [FEA] Better handling of reading lots of small files Jul 8, 2020

sameerz added the performance A performance related task/issue label Jul 8, 2020

tgravescs self-assigned this Jul 27, 2020

tgravescs added this to the Aug 3 - Aug 14 milestone Aug 4, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 4, 2020

sameerz modified the milestones: Aug 3 - Aug 14, Aug 17 - Aug 28 Aug 18, 2020

tgravescs changed the title ~~[FEA] Better handling of reading lots of small files~~ [FEA] Better handling of reading lots of small Parquet files Aug 20, 2020

tgravescs mentioned this issue Aug 20, 2020

Parquet small file reading optimization #595

Merged

tgravescs closed this as completed in #595 Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Better handling of reading lots of small Parquet files #333

[FEA] Better handling of reading lots of small Parquet files #333

tgravescs commented Jul 8, 2020

tgravescs commented Jul 8, 2020

tgravescs commented Aug 11, 2020

[FEA] Better handling of reading lots of small Parquet files #333

[FEA] Better handling of reading lots of small Parquet files #333

Comments

tgravescs commented Jul 8, 2020

tgravescs commented Jul 8, 2020

tgravescs commented Aug 11, 2020