Fix missing input bytes read metric for Parquet #1205

jlowe · 2020-11-25T21:11:58Z

DataSourceRDD blindly assumes that any input occurs on the main task thread, but the plugin's Parquet readers support reading the data using a thread pool. When DataSourceRDD tries to grab the Hadoop FileSystem statistics for the main task thread, it finds no bytes read and unconditionally smashes the input metric to zero.

Therefore this switches away from using DataSourceRDD as-is and instead creates a derivation that performs everything it used to do except updating the input bytes metric. It is now up to the various scans (CSV, ORC, Parquet, etc.) to update the input bytes statistic. A wrapper class for partitioned readers was added to make it easy for inputs that don't yet support multithreaded reads (CSV and ORC), and the Parquet multithreaded readers were updated to retrieve the bytes read stats for each thread and report that back with the work being performed.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe · 2020-11-25T21:12:05Z

build

jlowe · 2020-11-25T23:26:48Z

build

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

…IDIA#1205) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Fix missing input bytes read metric for Parquet

10a537a

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe added the SQL part of the SQL/Dataframe plugin label Nov 25, 2020

jlowe added this to the Nov 23 - Dec 4 milestone Nov 25, 2020

jlowe self-assigned this Nov 25, 2020

tgravescs approved these changes Nov 25, 2020

View reviewed changes

Merge branch 'branch-0.3' into fix-parquet-input-metrics

e380f9f

jlowe merged commit 6e034f7 into NVIDIA:branch-0.3 Nov 26, 2020

kuhushukla pushed a commit to kuhushukla/spark-rapids that referenced this pull request Nov 30, 2020

Fix missing input bytes read metric for Parquet (NVIDIA#1205)

028078a

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Fix missing input bytes read metric for Parquet (NVIDIA#1205)

0a0b03f

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Fix missing input bytes read metric for Parquet (NVIDIA#1205)

4c6b32a

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe deleted the fix-parquet-input-metrics branch September 10, 2021 15:41

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to 9be38d299748af3c29be29c6faa62cb3296bdf8d (NV…

35af9df

…IDIA#1205) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing input bytes read metric for Parquet #1205

Fix missing input bytes read metric for Parquet #1205

jlowe commented Nov 25, 2020

jlowe commented Nov 25, 2020

jlowe commented Nov 25, 2020

Fix missing input bytes read metric for Parquet #1205

Fix missing input bytes read metric for Parquet #1205

Conversation

jlowe commented Nov 25, 2020

jlowe commented Nov 25, 2020

jlowe commented Nov 25, 2020