Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multiple scans for the same parquet data source #483

Closed
chenrui17 opened this issue Jul 31, 2020 · 7 comments
Closed

[BUG] Multiple scans for the same parquet data source #483

chenrui17 opened this issue Jul 31, 2020 · 7 comments
Labels
bug Something isn't working performance A performance related task/issue

Comments

@chenrui17
Copy link

chenrui17 commented Jul 31, 2020

Describe the bug
i test tpc-ds query-2 , I find two identical operators on DAG graphs ,so it means that read one parquet table twice , lead to poor performance , webui As shown in the figure below :
image
dag graph as shown in the figure below :
image

Steps/Code to reproduce bug
my config is :
tpc-ds.q2.config.txt

@chenrui17 chenrui17 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 31, 2020
@wjxiz1992 wjxiz1992 changed the title gpu read parquet table (GpuScan Parquet)twice ?[BUG] [BUG] Multiple scans for the same parquet data source Jul 31, 2020
@sameerz sameerz added the performance A performance related task/issue label Jul 31, 2020
@revans2
Copy link
Collaborator

revans2 commented Jul 31, 2020

I just did a test on a simpler use case using TPCH data.

select * from 
 table1 t1 inner join table1 t2
 on t1.o_orderkey = t2.o_orderkey

does not read the data twice. I think it was fixed by #253

@revans2
Copy link
Collaborator

revans2 commented Jul 31, 2020

I'm going to try and run with tpcds as well, but it might take a bit to get everything setup for testing.

@LuciferYang
Copy link
Contributor

LuciferYang commented Aug 3, 2020

@chenrui17 We'd better write a test case to verify the effect of ReuseExchange rule in the latest version

@revans2
Copy link
Collaborator

revans2 commented Aug 3, 2020

The issue was around canonicalization. We started out using a copy of AttributeReference that we called GpuAttributeReference. This was so that we could maintain strict typing on expressions to ensure that when we translated something to run the GPU that the scala type checker would help verify that all of it was translated, no exceptions. That turned out to be a bad choice because AttributeReference is hard coded all over the place in spark especially when trying to do canonicalization.

https://github.com/apache/spark/blob/7f5326c082081af465deb81c368306e569325b53/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L292-L296

Canonicalization of the SparkPlan changes it so that it can no longer be executed, but if two different versions of the same plan exist they will be equal. This is what the ReuseExchange rule does. It walks through all exchanges in a plan and checks to see if they are the same. If they are then it will dedupe them.

Be aware that there are still a few places where we can run into situations where exchanges that would be deduplicated in a CPU only query will not be with a GPU query. This mostly comes down to what we can and cannot translate to run on the GPU so we can hit situations like with #386 or issues with the plugin not doing a columnar exchange if the downstream part of the operation is not on the GPU too.

@revans2
Copy link
Collaborator

revans2 commented Aug 4, 2020

I was able to run q4 but at a smaller scale factor (50), and I verified that the de-duplication is happening properly.

@LuciferYang and @chenrui17 is it okay if I close this as fixed in the upcoming 0.2 release?

@revans2 revans2 removed the ? - Needs Triage Need team to review and classify label Aug 4, 2020
@LuciferYang
Copy link
Contributor

@chenrui17 please try it with branch-0.2 and give a feedback as soon as possible, thx ~

@chenrui17
Copy link
Author

@LuciferYang I test tpc-ds Query-2 with branch-0.2 in local mode ,This problem has been fixed。

pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants