Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage #18779

Closed
2 tasks done
CsekM8 opened this issue Sep 16, 2024 · 0 comments · Fixed by #18766
Closed
2 tasks done
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@CsekM8
Copy link

CsekM8 commented Sep 16, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

File structure:
/external-data/

year=YYYY/

month=MM/

data-YYYY-MM.parquet

import polars as pl

path = "az://external-data/*/*/*.parquet"
df_lazy = pl.scan_parquet(
        path,
        storage_options=storage_options,
        retries=15,
        hive_partitioning=True,
    )

df = df_lazy.filter(pl.col("year").ge(2024)).collect(streaming=True)

Log output

Async thread count: 3
POLARS PREFETCH_SIZE: 24
RUN STREAMING PIPELINE
[parquet -> ordered_sink]
STREAMING CHUNK SIZE: 1351 rows
POLARS ROW_GROUP PREFETCH_SIZE: 128
concurrency tuner finished after adding 1 steps
thread 'polars-0' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 59884346 must be prefetched in ColumnStore.

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'polars-5' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 4 must be prefetched in ColumnStore.

thread 'polars-6' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 91605741 must be prefetched in ColumnStore.

thread 'polars-8' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 29403039 must be prefetched in ColumnStore.

thread 'polars-10' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 9444021 must be prefetched in ColumnStore.

thread 'polars-9' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 70480370 must be prefetched in ColumnStore.

thread 'polars-11' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 102423841 must be prefetched in ColumnStore.

thread 'polars-1' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 39523238 must be prefetched in ColumnStore.

thread 'polars-2' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 80970486 must be prefetched in ColumnStore.

thread 'polars-3' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 19411300 must be prefetched in ColumnStore.

thread 'polars-7' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 113150080 must be prefetched in ColumnStore.

thread 'polars-4' panicked at crates\polars-io\src\parquet\read\mmap.rs:54:17:
mmap_columns: column with start 49618693 must be prefetched in ColumnStore.

Traceback (most recent call last):
  File "...\tmp.py", line 9, in <module>
    df = df_lazy.filter(pl.col("year").ge(2024)).collect(streaming=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\site-packages\polars\lazyframe\frame.py", line 2032, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: mmap_columns: column with start 4 must be prefetched in ColumnStore.

Issue description

Additional info based on testing:

  • No issue when run on the same dataset locally.
  • It is likely related to the streaming implementation as it doesn't occur when it is turned off and used on a subset that fits into memory.

Regarding tested polars versions:

  • Log output is from using 1.7.1
  • The issue doesn't occur on version 1.4.1 and below.
  • On 1.5.0 the same code raises a different panick exception:
    thread 'polars-3' panicked at crates\polars-parquet\src\parquet\encoding\bitpacked\decode.rs:41:49:
    called Result::unwrap() on an Err value: OutOfSpec("Bitpacking requires num_bits > 0")

The only somewhat similar issues I could find were: #12635 and #13162
This issue doesn't seem to be affected by the number of selected columns or threads as suggested in these.

Expected behavior

Collect dataframe without panick.

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.10 | packaged by conda-forge | (main, Sep 10 2024, 10:53:25) [MSC v.1940 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          2.1.0
connectorx           <not installed>
deltalake            0.19.1
fastexcel            <not installed>
fsspec               2023.12.2
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.6.3
nest_asyncio         1.6.0
numpy                1.23.5
openpyxl             3.0.7
pandas               1.5.3
pyarrow              16.1.0
pydantic             1.10.15
pyiceberg            <not installed>
sqlalchemy           1.4.52
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           1.4.5
@CsekM8 CsekM8 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 16, 2024
@CsekM8 CsekM8 changed the title Panick: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage Panic: 'collect(streaming=True)' on 'scan_parquet' Fails for Hive-Partitioned Parquet Files in Azure Storage Sep 16, 2024
@c-peters c-peters added the accepted Ready for implementation label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants