Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audit] Add bucketed scan info in query plan of data source v1 [databricks] #4461

Merged

Conversation

HaoYang670
Copy link
Collaborator

Signed-off-by: remzi 13716567376yh@gmail.com
close #3952

  1. update GpuFileSourceScanExec.scala to include the change of spark apache/spark@79515e4b6c
  2. add an integration test
  3. update shim layers from spark31X

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
add shim layer

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@HaoYang670 HaoYang670 added the audit_3.3.0 Audit related tasks for 3.3.0 label Jan 5, 2022
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits. But, we need to understand and fix the failing test. We also should run these against databricks so we are sure it works there too.

assert "Bucketed: false (disabled by query planner)" in df._sc._jvm.PythonSQLUtils.explainString(df._jdf.queryExecution(), "simple")


with_gpu_session(bucket_column_not_read)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have 1 test for all of these instead of 4 separate tests? It does not look like you are saving any code space or anything like that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback! I will split it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one failing test because the Spark version is before 3.1.0. I will add a skipif mark to the test.

@allow_non_gpu(any=True)
def test_explain_bucketed_scan(spark_tmp_table_factory):
"""
https://github.com/NVIDIA/spark-rapids/issues/3952
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A text explanation would be good to be able to get a high level idea of what is happening.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! I will add comments.

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670 HaoYang670 changed the title [Audit] [FEA][SPARK-32986][SQL] Add bucketed scan info in query plan of data source v1 [Audit] Add bucketed scan info in query plan of data source v1[databricks] Jan 6, 2022
@HaoYang670 HaoYang670 changed the title [Audit] Add bucketed scan info in query plan of data source v1[databricks] [Audit] Add bucketed scan info in query plan of data source v1 [databricks] Jan 6, 2022
@HaoYang670
Copy link
Collaborator Author

build

@HaoYang670
Copy link
Collaborator Author

Solve merge conflicts because shim layers are refactored.

@HaoYang670 HaoYang670 marked this pull request as draft January 7, 2022 01:40
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@HaoYang670 HaoYang670 marked this pull request as ready for review January 7, 2022 05:00
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some of the files might need to be updated for 2022 copyright. Other than that it looks good.

@sameerz sameerz added this to the Jan 10 - Jan 28 milestone Jan 9, 2022
@HaoYang670
Copy link
Collaborator Author

Thank you. I will update them

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670 HaoYang670 closed this Jan 10, 2022
@HaoYang670 HaoYang670 reopened this Jan 10, 2022
@HaoYang670
Copy link
Collaborator Author

build

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 9446618 into NVIDIA:branch-22.02 Jan 11, 2022
@HaoYang670 HaoYang670 deleted the issue3952_add_bucketed_scan_info branch January 12, 2022 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Audit] [FEA][SPARK-32986][SQL] Add bucketed scan info in query plan of data source v1
3 participants