Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Some queries fail when cost-based optimizations are enabled #1899

Closed
andygrove opened this issue Mar 9, 2021 · 5 comments · Fixed by #1910 or #1954
Closed

[BUG] Some queries fail when cost-based optimizations are enabled #1899

andygrove opened this issue Mar 9, 2021 · 5 comments · Fixed by #1910 or #1954
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug
With the experimental cost-based optimizer enabled, 23 of the NDS queries fail due to inconsistent joins (incompatible mix of CPU/GPU operators).

The queries that fail are q7, q9, q26, q27, q28, q30, q32, q36, q44, q59, q81, q92, q1, q6, q10, q54, q85, q94, q11, q13, q16, q23a, q35

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 9, 2021
@andygrove andygrove added this to the Mar 1 - Mar 12 milestone Mar 9, 2021
@andygrove andygrove self-assigned this Mar 9, 2021
@andygrove andygrove changed the title [BUG] Some queries fail due to inconsistent joins when cost-based optimizations are enabled [BUG] Some queries fail when cost-based optimizations are enabled Mar 9, 2021
@andygrove
Copy link
Contributor Author

q6 fails with this when running against Spark 3.1.1 but works with Spark 3.0.2 (with AQE and RAPIDS CBO enabled in both cases)

java.util.NoSuchElementException: key not found: numPartitions
        at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:101)
        at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:99)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.sendDriverMetrics(CustomShuffleReaderExec.scala:122)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.shuffleRDD$lzycompute(CustomShuffleReaderExec.scala:182)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.shuffleRDD(CustomShuffleReaderExec.scala:181)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.doExecuteColumnar(CustomShuffleReaderExec.scala:196)

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 9, 2021
@andygrove
Copy link
Contributor Author

The q6 error above was misleading. There is a regression in Spark 3.1.1 with error handling related to executing on a canonicalized plan. I filed https://issues.apache.org/jira/browse/SPARK-34682.

@andygrove
Copy link
Contributor Author

Most of these failures are due to a single issue. CBO is sometimes forcing a GPU CustomShuffleReaderExec back onto CPU, making it incompatible with the GPU shuffle that already happened.

@sameerz
Copy link
Collaborator

sameerz commented Mar 21, 2021

@andygrove is this resolved with #1910 ?

@andygrove
Copy link
Contributor Author

@andygrove is this resolved with #1910 ?

@sameerz No, but it is resolved by #1954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment