-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Raise more informative error message for directories containing files with mixed extensions #17480
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17480 +/- ##
=======================================
Coverage 80.46% 80.47%
=======================================
Files 1483 1483
Lines 195138 195159 +21
Branches 2782 2782
=======================================
+ Hits 157014 157047 +33
+ Misses 37612 37600 -12
Partials 512 512 ☔ View full report in Codecov by Sentry. |
If you call |
We decided not to determine what parquet files are by checking whitelisted extensions. E.g. I also don't want to check for magic bytes on all the files in a directory, as that would require downloading all of them potentially, and this is impossible for (compressed), csv, json. The idea is that if you pass a directory, you guarantee it is a (hive partitioned) dataset. If you want to load all files with a certain file extension, we give you the possibility to do so via globbing patterns. |
8739d4b
to
841ffbf
Compare
… files with mixed extensions (pola-rs#17480)
If the user passes a single directory to a
scan_*
function, we will now check that all files underneath it have the same file extension. If this is not the case an error message is raised showing the offending paths and recommending to use glob patterns if they still wish to scan all files.Combining this with #17478 will also make it so that file extensions of empty files are ignored.
Fixes #17436