Add ignoreCorruptFiles for ORC readers [databricks] #4809

wbo4958 · 2022-02-17T05:41:33Z

This PR is to close #4802, #4803 and #4795.

Since ORC/Parquet PERFILE and texted-based GPU readings (CSV, JSON) are already in the scope of FileScanRDD which has done the feature ignoring missing and ignoring corrupt files. we don't need to do extra things for this.

I can't find a way to trigger the IOException or RuntimeException for CPU CSV/JSON parsing, since CPU can parse wrong files when CSV/JSON reading.

But for JSON corrupt files, CUDF will throw an exception, I just add a test for GPU JSON reading.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

wbo4958 · 2022-02-17T05:43:05Z

build

jlowe

Doesn't GpuTextBasedPartitionReader need a try..catch block around readToTable(Boolean) to cover the case of the filesystem returning an I/O exception or runtime exception? Spark CPU covers this in a central place via FilePartitionReader.next which skips files that return I/O errors or other runtime exceptions when ignoreCorruptFiles is true.

wbo4958 · 2022-02-18T00:29:01Z

Doesn't GpuTextBasedPartitionReader need a try..catch block around readToTable(Boolean) to cover the case of the filesystem returning an I/O exception or runtime exception? Spark CPU covers this in a central place via FilePartitionReader.next which skips files that return I/O errors or other runtime exceptions when ignoreCorruptFiles is true.

Yeah, we don't need to do that, since Spark itself has helped us for this.

For V1, we will get FileScanRDD or GpuFileScanRDD, both of which has implemented ignoreCorruptFiles/ignoreMissingValues

For V2, we will get GpuDataSourceRDD, and the GpuTextBasedPartitionReader will be wrapped into FilePartitionReader which has implemented the ignoreCorruptFiles/ignoreMissingValues.

I added the test_json_read_with_corrupt_files for both v1 and v2 scenarios, which can tested the above scenarios.

Add ignoreCorruptFiles for ORC readers

4c6325b

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

jlowe reviewed Feb 17, 2022

View reviewed changes

sameerz added the bug Something isn't working label Feb 18, 2022

sameerz added this to the Feb 14 - Feb 25 milestone Feb 18, 2022

This was linked to issues Feb 18, 2022

[BUG] Read ORC does not ignoreCorruptFiles #4795

Closed

[BUG] GPU JSON read does not honor ignoreCorruptFiles or ignoreMissingFiles #4803

Closed

jlowe approved these changes Feb 18, 2022

View reviewed changes

jlowe merged commit 75b49a9 into NVIDIA:branch-22.04 Feb 18, 2022

wbo4958 deleted the orc-ignore-corrupt-files branch February 19, 2022 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ignoreCorruptFiles for ORC readers [databricks] #4809

Add ignoreCorruptFiles for ORC readers [databricks] #4809

wbo4958 commented Feb 17, 2022

wbo4958 commented Feb 17, 2022

jlowe left a comment

wbo4958 commented Feb 18, 2022

Add ignoreCorruptFiles for ORC readers [databricks] #4809

Add ignoreCorruptFiles for ORC readers [databricks] #4809

Conversation

wbo4958 commented Feb 17, 2022

wbo4958 commented Feb 17, 2022

jlowe left a comment

Choose a reason for hiding this comment

wbo4958 commented Feb 18, 2022