Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Throw bigidx error for Parquet row-count #18154

Merged
merged 3 commits into from
Aug 14, 2024

Conversation

coastalwhite
Copy link
Collaborator

Fixes #18149

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Aug 13, 2024
Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 81.90476% with 19 lines in your changes missing coverage. Please review.

Project coverage is 80.29%. Comparing base (4233d61) to head (742fad0).
Report is 18 commits behind head on main.

Files Patch % Lines
crates/polars-core/src/chunked_array/ops/extend.rs 72.72% 6 Missing ⚠️
crates/polars-io/src/parquet/read/read_impl.rs 71.42% 4 Missing ⚠️
crates/polars-error/src/lib.rs 0.00% 3 Missing ⚠️
...s/polars-core/src/series/implementations/binary.rs 50.00% 1 Missing ⚠️
...s-core/src/series/implementations/binary_offset.rs 50.00% 1 Missing ⚠️
...tes/polars-core/src/series/implementations/date.rs 50.00% 1 Missing ⚠️
...polars-core/src/series/implementations/duration.rs 50.00% 1 Missing ⚠️
...s/polars-core/src/series/implementations/floats.rs 50.00% 1 Missing ⚠️
...tes/polars-core/src/series/implementations/time.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18154      +/-   ##
==========================================
- Coverage   80.32%   80.29%   -0.04%     
==========================================
  Files        1496     1498       +2     
  Lines      197751   198688     +937     
  Branches     2822     2831       +9     
==========================================
+ Hits       158848   159539     +691     
- Misses      38381    38622     +241     
- Partials      522      527       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

acc_df.vstack_mut(&df)?;
}

if acc_height > IdxSize::MAX as usize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should check this here. This should fail deeper down when constructing the chunkedarray. The append operation should do the height check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need some checks for the Parquet reader because of the row_offset, but I added the checks to append and extend.

The following now also throws a proper error:

import polars as pl
import numpy as np

t1 = pl.Series(np.zeros(4_000_000_000), dtype=bool)
t2 = pl.Series(np.zeros(1_000_000_000), dtype=bool)

t1.append(t2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better.

Why do we still need it in parquet? As processing datasets that exceed that limit is fine, we only shouldn't create morsels that are larger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that previous_row_count will silently overflow, which might produce wrong results. If you read a file with 5 billion times False. It will overflow there instead of actually getting to the accumulate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if we use u64 here, would that work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite a big change and the overflow will still happen somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Not worth it for now. 👍

@@ -353,7 +354,10 @@ fn rg_to_dfs_prefiltered(
});

for (_, df) in &dfs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the check is needed here as well if we can catch it at the source, ie chunkedarray construction.

@ritchie46 ritchie46 merged commit 9b4bd7f into pola-rs:main Aug 14, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Throw Exception Reading > UInt32 Row Datasets
2 participants