Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignoreCorruptFiles for ORC readers [databricks] #4809

Merged
merged 1 commit into from
Feb 18, 2022

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Feb 17, 2022

This PR is to close #4802, #4803 and #4795.

Since ORC/Parquet PERFILE and texted-based GPU readings (CSV, JSON) are already in the scope of FileScanRDD which has done the feature ignoring missing and ignoring corrupt files. we don't need to do extra things for this.

I can't find a way to trigger the IOException or RuntimeException for CPU CSV/JSON parsing, since CPU can parse wrong files when CSV/JSON reading.

But for JSON corrupt files, CUDF will throw an exception, I just add a test for GPU JSON reading.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Feb 17, 2022

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't GpuTextBasedPartitionReader need a try..catch block around readToTable(Boolean) to cover the case of the filesystem returning an I/O exception or runtime exception? Spark CPU covers this in a central place via FilePartitionReader.next which skips files that return I/O errors or other runtime exceptions when ignoreCorruptFiles is true.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Feb 18, 2022

Doesn't GpuTextBasedPartitionReader need a try..catch block around readToTable(Boolean) to cover the case of the filesystem returning an I/O exception or runtime exception? Spark CPU covers this in a central place via FilePartitionReader.next which skips files that return I/O errors or other runtime exceptions when ignoreCorruptFiles is true.

Yeah, we don't need to do that, since Spark itself has helped us for this.

For V1, we will get FileScanRDD or GpuFileScanRDD, both of which has implemented ignoreCorruptFiles/ignoreMissingValues

For V2, we will get GpuDataSourceRDD, and the GpuTextBasedPartitionReader will be wrapped into FilePartitionReader which has implemented the ignoreCorruptFiles/ignoreMissingValues.

I added the test_json_read_with_corrupt_files for both v1 and v2 scenarios, which can tested the above scenarios.

@jlowe jlowe merged commit 75b49a9 into NVIDIA:branch-22.04 Feb 18, 2022
@wbo4958 wbo4958 deleted the orc-ignore-corrupt-files branch February 19, 2022 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants