Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage #18779

CsekM8 · 2024-09-16T15:49:23Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

File structure:
/external-data/

year=YYYY/

month=MM/

data-YYYY-MM.parquet

import polars as pl

path = "az://external-data/*/*/*.parquet"
df_lazy = pl.scan_parquet(
        path,
        storage_options=storage_options,
        retries=15,
        hive_partitioning=True,
    )

df = df_lazy.filter(pl.col("year").ge(2024)).collect(streaming=True)

Log output

Async thread count: 3
POLARS PREFETCH_SIZE: 24
RUN STREAMING PIPELINE
[parquet -> ordered_sink]
STREAMING CHUNK SIZE: 1351 rows
POLARS ROW_GROUP PREFETCH_SIZE: 128
concurrency tuner finished after adding 1 steps
thread 'polars-0' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 59884346 must be prefetched in ColumnStore.

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'polars-5' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 4 must be prefetched in ColumnStore.

thread 'polars-6' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 91605741 must be prefetched in ColumnStore.

thread 'polars-8' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 29403039 must be prefetched in ColumnStore.

thread 'polars-10' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 9444021 must be prefetched in ColumnStore.

thread 'polars-9' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 70480370 must be prefetched in ColumnStore.

thread 'polars-11' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 102423841 must be prefetched in ColumnStore.

thread 'polars-1' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 39523238 must be prefetched in ColumnStore.

thread 'polars-2' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 80970486 must be prefetched in ColumnStore.

thread 'polars-3' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 19411300 must be prefetched in ColumnStore.

thread 'polars-7' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 113150080 must be prefetched in ColumnStore.

thread 'polars-4' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 49618693 must be prefetched in ColumnStore.

Traceback (most recent call last):
  File "...\tmp.py", line 9, in <module>
    df = df_lazy.filter(pl.col("year").ge(2024)).collect(streaming=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\site-packages\polars\lazyframe\frame.py", line 2032, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: mmap_columns: column with start 4 must be prefetched in ColumnStore.

Issue description

Additional info based on testing:

No issue when run on the same dataset locally.
It is likely related to the streaming implementation as it doesn't occur when it is turned off and used on a subset that fits into memory.

Regarding tested polars versions:

Log output is from using 1.7.1
The issue doesn't occur on version 1.4.1 and below.
On 1.5.0 the same code raises a different panick exception:
thread 'polars-3' panicked at crates\polars-parquet\src\parquet\encoding\bitpacked\decode.rs:41:49:
called Result::unwrap() on an Err value: OutOfSpec("Bitpacking requires num_bits > 0")

The only somewhat similar issues I could find were: #12635 and #13162
This issue doesn't seem to be affected by the number of selected columns or threads as suggested in these.

Expected behavior

Collect dataframe without panick.

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.10 | packaged by conda-forge | (main, Sep 10 2024, 10:53:25) [MSC v.1940 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          2.1.0
connectorx           <not installed>
deltalake            0.19.1
fastexcel            <not installed>
fsspec               2023.12.2
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.6.3
nest_asyncio         1.6.0
numpy                1.23.5
openpyxl             3.0.7
pandas               1.5.3
pyarrow              16.1.0
pydantic             1.10.15
pyiceberg            <not installed>
sqlalchemy           1.4.52
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           1.4.5

The text was updated successfully, but these errors were encountered:

CsekM8 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 16, 2024

CsekM8 changed the title ~~Panick: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage~~ Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage Sep 16, 2024

nameexhaustion mentioned this issue Sep 16, 2024

fix: Dropped/shifted rows in parquet scan with streaming=True #18766

Merged

ritchie46 closed this as completed in #18766 Sep 17, 2024

c-peters added the accepted Ready for implementation label Sep 23, 2024

c-peters assigned nameexhaustion Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage #18779

Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage #18779

CsekM8 commented Sep 16, 2024

Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage #18779

Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage #18779

Comments

CsekM8 commented Sep 16, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions