Unable to read hive-style partitioned parquet file using `read_parquet` #10276

lmocsi · 2023-08-03T16:54:34Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

"""
I have a partitioned parquet file like this (filename is
my_table_name, and it is partitioned by calendar_date column, see attached):

my_table_name
 |
 +[calendar_date=2023-08-01 00%3A00%3A00]
 | |
 | + part-00000-b3b03a4e.c000.snappy.parquet
 | 
 +[calendar_date=2023-08-02 00%3A00%3A00]
 | |
 | + part-00000-f4b5a541.c000.snappy.parquet
 |
 +[calendar_date=2023-08-03 00%3A00%3A00]
   |
   + part-00000-6cf29fe7.c000.snappy.parquet

"""

import polars as pl
df = pl.read_parquet(path+ 'my_table_name/*')  # on SO it was recommended, that /* could be used

"""
It is giving me the error:
ComputeError: error while reading /path/my_table_name/CALENDAR_DATE=2023-08-01 00%3A00%3A00: External format error: File out of specification: underlying IO error: Invalid argument (os error 22)
"""
my_table_name.zip

Issue description

It seems, that as of now polars is supporting only two types of parquet structure:

all the data is in one parquet file
the data is split into separate files within one directory

Though, true partitioned parquet files (where you have separate directory for each partition) does not seem to be supported. :(

Expected behavior

Able to read the partitioned parquet file, even as a lazy dataframe (just like Spark does)

Installed versions

--------Version info---------
Polars:              0.18.11
Index type:          UInt32
Platform:            Linux-4.18.0-372.51.1.el8_6.x86_64-x86_64-with-glibc2.28
Python:              3.9.13 (main, Oct 13 2022, 21:15:33) 
[GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.0.0
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2022.02.0
matplotlib:          3.7.2
numpy:               1.21.6
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          1.4.27
xlsx2csv:            <not installed>
xlsxwriter:          3.1.2

</details>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-08-03T17:17:22Z

There is scan_pyarrow_dataset()

import pyarrow.dataset as ds

pl.scan_pyarrow_dataset(ds.dataset("my_table_name")).collect()

# shape: (13, 2)
# ┌─────────┬─────────┐
# │ USER_ID ┆ TRX_CNT │
# │ ---     ┆ ---     │
# │ f64     ┆ f64     │
# ╞═════════╪═════════╡
# │ 1000.0  ┆ 434.0   │
# │ 1001.0  ┆ 11.0    │
# │ 1002.0  ┆ 3.0     │
# │ 1003.0  ┆ 555.0   │
# │ …       ┆ …       │
# │ 1001.0  ┆ 21.0    │
# │ 1003.0  ┆ 44.0    │
# │ 1005.0  ┆ 111.0   │
# │ 1008.0  ┆ 222.0   │
# └─────────┴─────────┘

alexander-beedie · 2023-08-03T17:57:29Z

df = pl.read_parquet(path+ 'my_table_name/') # on SO it was recommended, that / could be used

This is correct if all the files are in the same directory, but otherwise (as @cmdlineluser says) you need to use scan_pyarrow_dataset to read directory-nested (hive-style) partitioned parquet data.

(I've just committed a small update to the docs that adds a more explicit note to read_parquet and scan_parquet to help direct users to the right method).

stinodego · 2023-08-03T20:18:56Z

I think read_parquet should support this. If I am trying to read a parquet file, I should be using read_parquet. Even if it is partitioned.

Not sure how hard this is to implement, but it should be a goal, in my opinion.

lmocsi · 2023-08-03T21:36:11Z

Could pl.read_parquet() just call this pl.scan_pyarrow_dataset() function?
On the other hand, the reading up this parquet should include the partitioning column (here calendar_date), as well...

alexander-beedie · 2023-08-03T21:48:11Z

(Updated the title to clarify the issue more specifically, so we can reference it easily later).

lmocsi · 2023-08-04T10:25:17Z

Added my last comment as a separate issue: [https://github.com//issues/10296]

universalmind303 · 2023-09-08T17:28:29Z

looks like #4347 and #426 are duplicates. Since this is the most recent one, I'll keep this open & close out the other two.

ddutt · 2023-09-08T18:42:34Z

I think read_parquet should support this. If I am trying to read a parquet file, I should be using read_parquet. Even if it is partitioned.

Not sure how hard this is to implement, but it should be a goal, in my opinion.

scan_pyarrow_dataset performance isn't as good. Im doing the addition of the hive directories as columns, and i have a lot of folders, and this method is still faster than using scan_pyarrow_dataset. Please add this support to scan parquet and read_parquet

stinodego · 2024-01-12T22:43:08Z

This should be fixed by #13044

If not, please comment and I can reopen this issue.

lmocsi · 2024-01-21T18:42:49Z

Hi, In polars 0.20.5 this works: pl.read_parquet('my_table_name/**/*.parquet') but this: pl.read_parquet('my_table_name') throws the bellow error: IsADirectoryError: expected a file path; 'my_table_name' is a directory Can I not use a directory here? (I should be able to) Thanks, Lmocsi <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Vírusmentes.www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

…

On Fri, Jan 12, 2024 at 11:43 PM Stijn de Gooijer ***@***.***> wrote: This should be fixed by #13044 <#13044> If not, please comment and I can reopen this issue. — Reply to this email directly, view it on GitHub <#10276 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APOYDWZQGT2KDV76BDFLRDLYOG4ATAVCNFSM6AAAAAA3DB4GP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQGA3TONZUGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

lmocsi added bug Something isn't working python Related to Python Polars labels Aug 3, 2023

alexander-beedie mentioned this issue Aug 3, 2023

docs(python): make an explicit note in read_parquet and scan_parquet about hive-style partitioning (point to scan_pyarrow_dataset instead) #10277

Merged

stinodego added the accepted Ready for implementation label Aug 3, 2023

alexander-beedie changed the title ~~Unable to read partitioned parquet file~~ Unable to read hive-style partitioned parquet file using read_parquet Aug 3, 2023

This was referenced Sep 8, 2023

Support hive style partitioning of parquet file scans #4347

Closed

Partition aware parquet scanning #426

Closed

ion-elgreco mentioned this issue Sep 25, 2023

feat: support 'hive partitioning' aware readers #11284

Merged

grantmcdermott mentioned this issue Sep 28, 2023

Read partition columns of Hive dataset pola-rs/r-polars#404

Closed

stinodego closed this as completed Jan 12, 2024

stinodego added A-io Area: reading and writing data and removed accepted Ready for implementation labels Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read hive-style partitioned parquet file using `read_parquet` #10276

Unable to read hive-style partitioned parquet file using `read_parquet` #10276

lmocsi commented Aug 3, 2023

cmdlineluser commented Aug 3, 2023

alexander-beedie commented Aug 3, 2023 •

edited

Loading

stinodego commented Aug 3, 2023 •

edited

Loading

lmocsi commented Aug 3, 2023

alexander-beedie commented Aug 3, 2023

lmocsi commented Aug 4, 2023 •

edited

Loading

universalmind303 commented Sep 8, 2023

ddutt commented Sep 8, 2023

stinodego commented Jan 12, 2024

lmocsi commented Jan 21, 2024 via email

Unable to read hive-style partitioned parquet file using read_parquet #10276

Unable to read hive-style partitioned parquet file using read_parquet #10276

Comments

lmocsi commented Aug 3, 2023

Checks

Reproducible example

Issue description

Expected behavior

Installed versions

cmdlineluser commented Aug 3, 2023

alexander-beedie commented Aug 3, 2023 • edited Loading

stinodego commented Aug 3, 2023 • edited Loading

lmocsi commented Aug 3, 2023

alexander-beedie commented Aug 3, 2023

lmocsi commented Aug 4, 2023 • edited Loading

universalmind303 commented Sep 8, 2023

ddutt commented Sep 8, 2023

stinodego commented Jan 12, 2024

lmocsi commented Jan 21, 2024 via email

Unable to read hive-style partitioned parquet file using `read_parquet` #10276

Unable to read hive-style partitioned parquet file using `read_parquet` #10276

alexander-beedie commented Aug 3, 2023 •

edited

Loading

stinodego commented Aug 3, 2023 •

edited

Loading

lmocsi commented Aug 4, 2023 •

edited

Loading