[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #19873

viirya · 2017-12-04T07:18:31Z

What changes were proposed in this pull request?

The SQL Analyzer goes through a whole query plan even most part of it is analyzed. This increases the time spent on query analysis for long pipelines in ML, especially.

This patch adds a logical node called AnalysisBarrier that wraps an analyzed logical plan to prevent it from analysis again. The barrier is applied to the analyzed logical plan in Dataset. It won't change the output of wrapped logical plan and just acts as a wrapper to hide it from analyzer. New operations on the dataset will be put on the barrier, so only the new nodes created will be analyzed.

This analysis barrier will be removed at the end of analysis stage.

How was this patch tested?

Added tests.

viirya · 2017-12-04T07:22:17Z

cc @cloud-fan @hvanhovell Basically this is the same changes in #17770.

SparkQA · 2017-12-04T08:05:02Z

Test build #84417 has finished for PR 19873 at commit 136fd30.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalysisBarrier(child: LogicalPlan) extends LeafNode

viirya · 2017-12-04T08:06:22Z

retest this please.

SparkQA · 2017-12-04T10:24:42Z

Test build #84420 has finished for PR 19873 at commit 136fd30.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalysisBarrier(child: LogicalPlan) extends LeafNode

SparkQA · 2017-12-04T17:15:22Z

Test build #84430 has finished for PR 19873 at commit 9f5a0e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalysisBarrier(child: LogicalPlan) extends LeafNode

cloud-fan · 2017-12-05T06:19:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

-   *
-   * @param rule the function use to transform this nodes children
-   */
-  def resolveOperators(rule: PartialFunction[LogicalPlan, LogicalPlan]): LogicalPlan = {


can we also remove the analyzed flag in this class?

cloud-fan · 2017-12-05T06:23:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

@@ -241,7 +241,7 @@ class PlannerSuite extends SharedSQLContext {
  test("collapse adjacent repartitions") {
    val doubleRepartitioned = testData.repartition(10).repartition(20).coalesce(5)
    def countRepartitions(plan: LogicalPlan): Int = plan.collect { case r: Repartition => r }.length
-    assert(countRepartitions(doubleRepartitioned.queryExecution.logical) === 3)
+    assert(countRepartitions(doubleRepartitioned.queryExecution.analyzed) === 3)


is it a necessary change?

Please see previous discussion: https://github.com/apache/spark/pull/17770/files#r118480364

cloud-fan · 2017-12-05T06:24:08Z

LGTM, also cc @gatorsmile

gatorsmile · 2017-12-05T06:39:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@@ -280,7 +280,7 @@ object TypeCoercion {
   */
  object WidenSetOperationTypes extends Rule[LogicalPlan] {

-    def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
      case p if p.analyzed => p


Sorry, what do you mean why?

In which cases, we should still use the analyzed flag?

gatorsmile · 2017-12-05T06:41:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -666,7 +667,9 @@ class Analyzer(
     * Generate a new logical plan for the right child with different expression IDs
     * for all conflicting attributes.
     */
-    private def dedupRight (left: LogicalPlan, right: LogicalPlan): LogicalPlan = {
+    private def dedupRight (left: LogicalPlan, oriRight: LogicalPlan): LogicalPlan = {


What is oriRight ?

Use originalRight

gatorsmile · 2017-12-05T06:43:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -470,7 +470,7 @@ case class DataSource(
      }.head
    }
    // For partitioned relation r, r.schema's column ordering can be different from the column
-    // ordering of data.logicalPlan (partition columns are all moved after data column).  This
+    // ordering of data.logicalPlan (partition columns are all moved after data column). This


Get rid of changes in this file.

gatorsmile · 2017-12-05T06:50:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      case sa @ Sort(_, _, child: Aggregate) => sa

-      case s @ Sort(order, _, child) if !s.resolved && child.resolved =>
+      case s @ Sort(order, _, oriChild) if !s.resolved && oriChild.resolved =>


Use originalChild

gatorsmile · 2017-12-05T06:50:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1098,7 +1103,8 @@ class Analyzer(
          case ae: AnalysisException => s
        }

-      case f @ Filter(cond, child) if !f.resolved && child.resolved =>
+      case f @ Filter(cond, oriChild) if !f.resolved && oriChild.resolved =>


Use originalChild

gatorsmile · 2017-12-05T07:09:48Z

From the PR description, I am unable to tell the changes made in this PR. We need a better description to explain what is the solution proposed in this PR.

Also explains which cases need a special handling and the reason.

SparkQA · 2017-12-05T12:52:54Z

Test build #84475 has finished for PR 19873 at commit bae034d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T13:05:23Z

Test build #84477 has finished for PR 19873 at commit 54182bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-05T20:18:43Z

@viirya Could you resolve the conflicts?

gatorsmile · 2017-12-05T20:19:52Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -881,3 +881,10 @@ case class Deduplicate(

  override def output: Seq[Attribute] = child.output
 }
+
+/** A logical plan for setting a barrier of analysis */
+case class AnalysisBarrier(child: LogicalPlan) extends LeafNode {


Put the PR descriptions to the comment of this class?

gatorsmile · 2017-12-05T20:21:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala

@@ -23,8 +23,8 @@ import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.types.IntegerType

 /**
- * This suite is used to test [[LogicalPlan]]'s `resolveOperators` and make sure it can correctly
- * skips sub-trees that have already been marked as analyzed.
+ * This suite is used to test [[LogicalPlan]]'s `transformUp` plus analysis barrier and make sure


Since both transformUp and transformDown work, create a test case using transformDown. Also update the comments here.

viirya · 2017-12-06T02:13:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

-      case Kurtosis(e @ StringType()) => Kurtosis(Cast(e, DoubleType))
-    }
+    override protected def coerceTypes(plan: LogicalPlan): LogicalPlan =
+      plan transformAllExpressions {


For indentation...

SparkQA · 2017-12-06T05:17:59Z

Test build #84518 has finished for PR 19873 at commit 4775a02.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ReqAndHandler(req: Request, handler: MemberHandler)
trait TypeCoercionRule extends Rule[LogicalPlan] with Logging

SparkQA · 2017-12-06T05:26:44Z

Test build #84520 has finished for PR 19873 at commit d2375e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-06T05:43:55Z

LGTM

gatorsmile · 2017-12-06T05:44:01Z

Thanks! Merged to master.

viirya · 2017-12-06T06:30:13Z

Thanks! @gatorsmile @cloud-fan

viirya force-pushed the SPARK-20392-reopen branch from 136fd30 to 9f5a0e4 Compare December 4, 2017 14:16

Add analysis barrier around analyzed plans.

9f5a0e4

cloud-fan reviewed Dec 5, 2017

View reviewed changes

gatorsmile reviewed Dec 5, 2017

View reviewed changes

viirya force-pushed the SPARK-20392-reopen branch from bae034d to 54182bf Compare December 5, 2017 10:02

Remove analyzed stuff.

54182bf

gatorsmile reviewed Dec 5, 2017

View reviewed changes

Modify comment and test cases.

b7747c4

viirya commented Dec 6, 2017

View reviewed changes

viirya added 2 commits December 6, 2017 02:15

Merge remote-tracking branch 'upstream/master' into SPARK-20392-reopen

4775a02

Less change for indentation.

d2375e0

asfgit closed this in 00d176d Dec 6, 2017

HyukjinKwon mentioned this pull request Mar 22, 2023

[SPARK-42896][SQL][PYTHON] Make mapInPandas / mapInArrow support barrier mode execution #40520

Closed

viirya deleted the SPARK-20392-reopen branch December 27, 2023 18:35

[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #19873

[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #19873

Conversation

viirya commented Dec 4, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Dec 4, 2017

SparkQA commented Dec 4, 2017

viirya commented Dec 4, 2017

SparkQA commented Dec 4, 2017

SparkQA commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 5, 2017

Choose a reason for hiding this comment

viirya Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Dec 5, 2017 • edited Loading

SparkQA commented Dec 5, 2017

SparkQA commented Dec 5, 2017

gatorsmile commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 6, 2017

SparkQA commented Dec 6, 2017

gatorsmile commented Dec 6, 2017

gatorsmile commented Dec 6, 2017 • edited Loading

viirya commented Dec 6, 2017

viirya commented Dec 4, 2017 •

edited

Loading

viirya Dec 5, 2017 •

edited

Loading

gatorsmile commented Dec 5, 2017 •

edited

Loading

gatorsmile commented Dec 6, 2017 •

edited

Loading