Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): show column name if read_csv errors #7177

Merged
merged 3 commits into from
Feb 26, 2023

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Feb 25, 2023

Is it OK to show the column name in the read_csv error?

It's not obvious from the error message if the column number is zero-indexed or one-indexed (at least, it wasn't to me when I encountered this, and I had hundreds of columns - in particular, it's not clear how to set the dtype of the failing column if you don't know its name)

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Feb 25, 2023
@ritchie46
Copy link
Member

This touches highly optimized code. I am a bit hesitant for this. I will take a look at this later when I have got time to take a good look with performance in mind.

@MarcoGorelli
Copy link
Collaborator Author

sure, no worries - I'll do some performance tests to check the impact

@MarcoGorelli
Copy link
Collaborator Author

I tried running some performance checks here: https://www.kaggle.com/code/marcogorelli/polars-timing/notebook?scriptVersionId=120311373

Reading a 113million-row dataset (4.5 GB):

pl.read_csv('/kaggle/input/100-million-data-csv/custom_1988_2020.csv')

here:

  • 13.2 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

master:

  • 13.3 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pl.read_csv('/kaggle/input/100-million-data-csv/custom_1988_2020.csv', dtypes={'198801': pl.Boolean}, ignore_errors=True)

here:

  • 12.5 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

master:

  • 12.4 s ± 71.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@ritchie46
Copy link
Member

Right, I thoroughly looked at your changes and I agree that this should have no expected performance changes. Thanks a lot. Good error information is really valuable. 👍

@ritchie46 ritchie46 merged commit 8444f70 into pola-rs:master Feb 26, 2023
@ritchie46
Copy link
Member

As a side note. Next time when you benchmarks csv code, set rechunk=False this is a very expensive operation that doesn't relate to csv parsing at all.

@MarcoGorelli
Copy link
Collaborator Author

thanks, much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants