Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support partitioning by columns in chunked Parquet writer #7196

Closed
chinmaychandak opened this issue Jan 22, 2021 · 7 comments · Fixed by #10000
Closed

[FEA] Support partitioning by columns in chunked Parquet writer #7196

chinmaychandak opened this issue Jan 22, 2021 · 7 comments · Fixed by #10000
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@chinmaychandak
Copy link
Contributor

The chunked parquet writer does a great job of creating a single large parquet file output instead of generating of lots of smaller parquet files (extremely useful in batch ETL, streaming use cases).

But it does not support partitioning by columns, and putting an upper limit of the size of the large parquet file being overwritten before creating a new one. It would be absolutely great to have that support, if at all possible. I could not find the chunked parquet writer in the API docs, too. Any reason for that?

Also, do we have perf. metrics for the accelerated parquet writers/readers (including chunked writer)?

@chinmaychandak chinmaychandak added Needs Triage Need team to review and classify feature request New feature or request labels Jan 22, 2021
@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@randerzander
Copy link
Contributor

Still a desired [FEA]

@devavret
Copy link
Contributor

@randerzander Do you mean you want to use the chunked writer to write one column at a time?

@randerzander
Copy link
Contributor

@chinmaychandak can you say more about your use case?

@chinmaychandak
Copy link
Contributor Author

We use df.to_parquet(..., partition_cols=[a, b]) in custreamz (streaming) - a bunch of small parquet files are created per batch using this operation. The number of files becomes really large as more and more batches get processed.

Querying such a large number of parquet files downstream is inefficient - we need to be able to aggregate smaller parquet files (within each partition_col directory, ofc) so that downstream querying becomes much more efficient.

Without partition_cols, the chunked writer already does the trick.

Another FEA which would be really good (both w/ and w/o partition_cols) in the chunked writer is to be able to specify a size limit until which the chunked writer aggregates/overwrites smaller parquet files - once that size is reached, it should automatically start writing to a new parquet file.

Also, there's no docs on the chunked writer currently?

Let me know if you want me to open FEAs for the above two.

@chinmaychandak
Copy link
Contributor Author

BTW, Databricks Delta supports the above features, and they're super useful! It would be awesome if we could have them here.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@devavret devavret self-assigned this Jul 12, 2021
rapids-bot bot pushed a commit that referenced this issue Jan 14, 2022
Chunked writer (`class ParquetWriter`) now takes an argument `partition_cols`. For each call to `write_table(df)`, the `df` is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g.
Instead of repeated call to `write_to_dataset` like so:
```python
write_to_dataset(df1, root_path, partition_cols=['group'])
write_to_dataset(df2, root_path, partition_cols=['group'])
...
```
which will yield the following structure
```
root_dir/
  group=value1/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  group=value2/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  ...
```
One can write with
```python
pw = ParquetWriter(root_path, partition_cols=['group'])
pw.write_table(df1)
pw.write_table(df2)
pw.close()
```
to get the structure
```
root_dir/
  group=value1/
    <uuid1>.parquet
  group=value2/
    <uuid1>.parquet
  ...
```

Closes #7196
Also workaround fixes
fixes #9216
fixes #7011

TODO:

- [x] Tests

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)

URL: #10000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants