[FEA] Support partitioning by columns in chunked Parquet writer #7196

chinmaychandak · 2021-01-22T18:21:13Z

The chunked parquet writer does a great job of creating a single large parquet file output instead of generating of lots of smaller parquet files (extremely useful in batch ETL, streaming use cases).

But it does not support partitioning by columns, and putting an upper limit of the size of the large parquet file being overwritten before creating a new one. It would be absolutely great to have that support, if at all possible. I could not find the chunked parquet writer in the API docs, too. Any reason for that?

Also, do we have perf. metrics for the accelerated parquet writers/readers (including chunked writer)?

github-actions · 2021-02-26T17:29:27Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

randerzander · 2021-03-12T20:11:25Z

Still a desired [FEA]

devavret · 2021-03-16T22:28:29Z

@randerzander Do you mean you want to use the chunked writer to write one column at a time?

randerzander · 2021-03-16T22:52:21Z

@chinmaychandak can you say more about your use case?

chinmaychandak · 2021-03-18T14:41:33Z

We use df.to_parquet(..., partition_cols=[a, b]) in custreamz (streaming) - a bunch of small parquet files are created per batch using this operation. The number of files becomes really large as more and more batches get processed.

Querying such a large number of parquet files downstream is inefficient - we need to be able to aggregate smaller parquet files (within each partition_col directory, ofc) so that downstream querying becomes much more efficient.

Without partition_cols, the chunked writer already does the trick.

Another FEA which would be really good (both w/ and w/o partition_cols) in the chunked writer is to be able to specify a size limit until which the chunked writer aggregates/overwrites smaller parquet files - once that size is reached, it should automatically start writing to a new parquet file.

Also, there's no docs on the chunked writer currently?

Let me know if you want me to open FEAs for the above two.

chinmaychandak · 2021-03-18T14:42:47Z

BTW, Databricks Delta supports the above features, and they're super useful! It would be awesome if we could have them here.

github-actions · 2021-04-17T15:04:44Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Chunked writer (`class ParquetWriter`) now takes an argument `partition_cols`. For each call to `write_table(df)`, the `df` is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g. Instead of repeated call to `write_to_dataset` like so: ```python write_to_dataset(df1, root_path, partition_cols=['group']) write_to_dataset(df2, root_path, partition_cols=['group']) ... ``` which will yield the following structure ``` root_dir/ group=value1/ <uuid1>.parquet <uuid2>.parquet ... group=value2/ <uuid1>.parquet <uuid2>.parquet ... ... ``` One can write with ```python pw = ParquetWriter(root_path, partition_cols=['group']) pw.write_table(df1) pw.write_table(df2) pw.close() ``` to get the structure ``` root_dir/ group=value1/ <uuid1>.parquet group=value2/ <uuid1>.parquet ... ``` Closes #7196 Also workaround fixes fixes #9216 fixes #7011 TODO: - [x] Tests Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #10000

chinmaychandak added Needs Triage Need team to review and classify feature request New feature or request labels Jan 22, 2021

kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021

github-actions bot added the inactive-30d label Feb 26, 2021

github-actions bot removed the inactive-30d label Mar 12, 2021

github-actions bot added the inactive-30d label Apr 17, 2021

devavret self-assigned this Jul 12, 2021

devavret mentioned this issue Jan 7, 2022

Add partitioning support to Parquet chunked writer #10000

Merged

1 task

rapids-bot bot closed this as completed in #10000 Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support partitioning by columns in chunked Parquet writer #7196

[FEA] Support partitioning by columns in chunked Parquet writer #7196

chinmaychandak commented Jan 22, 2021

github-actions bot commented Feb 26, 2021

randerzander commented Mar 12, 2021

devavret commented Mar 16, 2021

randerzander commented Mar 16, 2021

chinmaychandak commented Mar 18, 2021

chinmaychandak commented Mar 18, 2021

github-actions bot commented Apr 17, 2021

[FEA] Support partitioning by columns in chunked Parquet writer #7196

[FEA] Support partitioning by columns in chunked Parquet writer #7196

Comments

chinmaychandak commented Jan 22, 2021

github-actions bot commented Feb 26, 2021

randerzander commented Mar 12, 2021

devavret commented Mar 16, 2021

randerzander commented Mar 16, 2021

chinmaychandak commented Mar 18, 2021

chinmaychandak commented Mar 18, 2021

github-actions bot commented Apr 17, 2021