feat: support 'hive partitioning' aware readers #11284

ritchie46 · 2023-09-24T11:46:52Z

closes #10980, #10276

alexander-beedie · 2023-09-25T09:54:57Z

Nice! Hive partitioning support is super-useful 👍

ritchie46 · 2023-09-25T18:09:04Z

Nice! Hive partitioning support is super-useful 👍

Yes, especially the savings are potentially huge!

ion-elgreco · 2023-09-25T19:13:35Z

Closes also this: #10276

uditrana · 2023-11-01T19:47:07Z

I am noticing an interesting pattern/bug with this feature.
I wrote a large dataset out using

df.write_parquet(
            file=DIR,
            use_pyarrow=True,
            pyarrow_options={
                "partition_cols": ["part_id_1", "part_id_2"],
                "basename_template": "test_{i}.parquet",
                "existing_data_behavior": "overwrite_or_ignore",
            },
        )

When I read these files back in using pl.read_parquet(), I observe 2 different results:

df1 = pl.read_parquet(DIR / "part_id_1=1" / "part_id_2="A" / "test_0.parquet")
df2 = pl.read_parquet(DIR / "part_id_1=1" / "part_id_2="A" / "*test_0.parquet")

In df1, part_id_1 and part_id_2 are missing from the columns, while they are present in df2. Basically, seems that read_parquet will not look for hive-partitioning structure if it is passed a single file, and even though in the second case it is reading only one file... the fact that the pattern could include multiple causes it to pick up the structure correctly.

uditrana · 2023-11-01T19:50:58Z

Also, slightly tangentially, could it make sense to expose an option in pl.read_parquet to include an option to turn off schema extension for hive partition columns? I have a use case where I am running into this error while trying to read a dataset:

ComputeError: invalid hive partitions

Extending the schema with the hive partitioned columns creates duplicate fields.

This dataset was written in a way to keep the partition columns in the partitioned dataframes so that other libraries (like Pandas) can read the dataframes individually and concatenate serially (since they don't support Hive natively yet).

WIP: add hive partitioning to readers

80b50eb

ritchie46 requested a review from orlp as a code owner September 24, 2023 11:46

ritchie46 added 4 commits September 25, 2023 09:47

move to polars-plan

23f7d9c

make statistics more general

5354f4b

prepare statistics

6932150

statistics in HivePartitions

e7dbde6

ritchie46 added 2 commits September 25, 2023 11:56

clippy

67e8db5

materialize columns

697f8ec

ritchie46 requested review from stinodego and alexander-beedie as code owners September 25, 2023 10:11

ritchie46 added 6 commits September 25, 2023 13:41

materialize columns in streaming

de76400

add not eq stats eval

c16d4bf

fmt

c599f15

create hive columns earlier

45b4c42

working

b6ffc09

add test

6d5667b

ritchie46 added the highlight Highlight this PR in the changelog label Sep 25, 2023

ritchie46 changed the title ~~WIP: support 'hive partitioning' aware readers~~ feat: support 'hive partitioning' aware readers Sep 25, 2023

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Sep 25, 2023

ritchie46 mentioned this pull request Sep 25, 2023

feat(python): natively support reading parquet for aws, gcp and azure #11210

Merged

windows :(

7a1f7db

ritchie46 merged commit 27e32dc into main Sep 26, 2023
25 checks passed

ritchie46 deleted the hive branch September 26, 2023 05:45

grantmcdermott mentioned this pull request Sep 29, 2023

Read partition columns of Hive dataset pola-rs/r-polars#404

Closed

romanovacca pushed a commit to romanovacca/polars that referenced this pull request Oct 1, 2023

feat: support 'hive partitioning' aware readers (pola-rs#11284)

cdcd737

uditrana mentioned this pull request Dec 5, 2023

Hive Partioning Reader Bug #12903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support 'hive partitioning' aware readers #11284

feat: support 'hive partitioning' aware readers #11284

ritchie46 commented Sep 24, 2023 •

edited

Loading

alexander-beedie commented Sep 25, 2023

ritchie46 commented Sep 25, 2023

ion-elgreco commented Sep 25, 2023

uditrana commented Nov 1, 2023

uditrana commented Nov 1, 2023 •

edited

Loading

feat: support 'hive partitioning' aware readers #11284

feat: support 'hive partitioning' aware readers #11284

Conversation

ritchie46 commented Sep 24, 2023 • edited Loading

alexander-beedie commented Sep 25, 2023

ritchie46 commented Sep 25, 2023

ion-elgreco commented Sep 25, 2023

uditrana commented Nov 1, 2023

uditrana commented Nov 1, 2023 • edited Loading

ritchie46 commented Sep 24, 2023 •

edited

Loading

uditrana commented Nov 1, 2023 •

edited

Loading