Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing input bytes read metric for Parquet #1205

Merged
merged 2 commits into from
Nov 26, 2020

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Nov 25, 2020

Fixes #1199.

DataSourceRDD blindly assumes that any input occurs on the main task thread, but the plugin's Parquet readers support reading the data using a thread pool. When DataSourceRDD tries to grab the Hadoop FileSystem statistics for the main task thread, it finds no bytes read and unconditionally smashes the input metric to zero.

Therefore this switches away from using DataSourceRDD as-is and instead creates a derivation that performs everything it used to do except updating the input bytes metric. It is now up to the various scans (CSV, ORC, Parquet, etc.) to update the input bytes statistic. A wrapper class for partitioned readers was added to make it easy for inputs that don't yet support multithreaded reads (CSV and ORC), and the Parquet multithreaded readers were updated to retrieve the bytes read stats for each thread and report that back with the work being performed.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added the SQL part of the SQL/Dataframe plugin label Nov 25, 2020
@jlowe jlowe added this to the Nov 23 - Dec 4 milestone Nov 25, 2020
@jlowe jlowe self-assigned this Nov 25, 2020
@jlowe
Copy link
Member Author

jlowe commented Nov 25, 2020

build

@jlowe
Copy link
Member Author

jlowe commented Nov 25, 2020

build

@jlowe jlowe merged commit 6e034f7 into NVIDIA:branch-0.3 Nov 26, 2020
kuhushukla pushed a commit to kuhushukla/spark-rapids that referenced this pull request Nov 30, 2020
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe deleted the fix-parquet-input-metrics branch September 10, 2021 15:41
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1205)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] No data size in Input column in Stages page from Spark UI when using Parquet as file source
2 participants