[SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution #40520

WeichenXu123 · 2023-03-22T12:05:30Z

What changes were proposed in this pull request?

Make mapInPandas / mapInArrow support barrier mode execution

Why are the changes needed?

This is the preparation PR for supporting mapInPandas / mapInArrow barrier execution in spark connect mode. The feature is required by machine learning use cases.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon · 2023-03-22T13:19:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInPandasExec.scala

@@ -28,7 +28,8 @@ import org.apache.spark.sql.execution.SparkPlan
 case class MapInPandasExec(
    func: Expression,
    output: Seq[Attribute],
-    child: SparkPlan)
+    child: SparkPlan,
+    override val isBarrier: Boolean)


It would require an implementation much complicated than this actually. The current implementation might work for now because these specific physical plans don't move around much for now but the implementation is flaky because the physical plans can change via Catalyst Optimizer (e.g., predicate pushdown) but the barrier execution mode requires that the barrier is created exactly the location where it's invoked.

change via Catalyst Optimizer (e.g., predicate pushdown)

Do you mean pushdown like this ?

Filter MapInPandas XXOp ....

to

MapInPandas Filter XXOp ....

?

I don't think we should allow pushdown in this case. MapInPandas executes arbitrary user code. Can we modify optimizer code to prevent it change the plan when it finds "is_barrier" set ?

The feature mapInPandas / mapInArrow supporting barrier mode is only for ML use case, we don't need to complicate it. We only need to support such scenario:

input_dataset.mapInPandas(..., is_barrier=True).collect()

I meant that the optimizers can might switch the physical plan order in the future, and that's why the implementation is flaky. Now this works because we don't switch the order anywhere else.

Can we modify optimizer code to prevent it change the plan when it finds "is_barrier" set ?

To have a complete implementation, we should have something like #19873 which brought a lot of side effect, and it got reverted in the end.

If this is only for internal purpose, let's probably don't expose is_barrier: bool = False to the API surface at least.

To simply the implementation, we can implement a barrierMapInPandasAndCollect instead, and define a execution plan stage like BarrierMapInPandasAndCollectExec

If this is only for internal purpose, let's probably don't expose is_barrier: bool = False to the API surface at least.

No. It should be public API. Third-party lib such as xgboost also need to use it.

WeichenXu123 · 2023-03-23T00:40:14Z

To address @HyukjinKwon 's concern about optimizer,

can we add is_barrier attribute into UnaryNode,
and if optimizer find a node marking is_barrier as True, then skip all optimizations around the node.

CC @cloud-fan @mengxr WDYT ?

Barrier mode is only used in specific ML case, i.e. in model training routine, we will only use it in one pattern:

dataset.mapInPandas(..., is_barrier=True).collect()

and we don't need complex optimization for it.

cloud-fan · 2023-03-23T00:49:16Z

hmmm why do we need to care about the optimizer? The optimizer is not sensitive to the physical execution engine, e.g. Presto, Spark, Flink have many similar SQL optimizations.

WeichenXu123 · 2023-03-23T00:50:12Z

hmmm why do we need to care about the optimizer? The optimizer is not sensitive to the physical execution engine, e.g. Preso, Spark, Flink have many similar SQL optimizations.

I am not familier with optimizer details but it is concern from @HyukjinKwon
But note this PR also changes logical plan operator like MapInPandas

HyukjinKwon · 2023-03-23T01:07:03Z

Predicate pushdown is just an example. e.g., you might want to combine adjacent MapInPandass but it would need a special handling if is_barrier flag is added.

HyukjinKwon · 2023-03-23T01:08:25Z

I am saying that real power of Catalyst optimizer is to optimize/reorder these logical plans, and I believe that's the reason why barrier execution wasn't introduced in SQL. The barrier has to be created exactly when the call is invoked, in which basically it requires something like #19873 to have a sound implementation.

However, I am fine with having this as an exception if you guys are fine with this.

zhengruifeng · 2023-03-23T02:22:40Z

Barrier mode is only used in specific ML case, i.e. in model training routine, we will only use it in one pattern:

dataset.mapInPandas(..., is_barrier=True).collect()

To simply the implementation, we can implement a barrierMapInPandasAndCollect instead, and define a execution plan stage like BarrierMapInPandasAndCollectExec

If it is the only usage case, i think it will be safe to add dedicated logical plan and physical plan for it.

WeichenXu123 · 2023-03-23T02:39:24Z

I am saying that real power of Catalyst optimizer is to optimize/reorder these logical plans, and I believe that's the reason why barrier execution wasn't introduced in SQL. The barrier has to be created exactly when the call is invoked, in which basically it requires something like #19873 to have a sound implementation.

However, I am fine with having this as an exception if you guys are fine with this.

@cloud-fan What do you think of this ?

cloud-fan · 2023-03-23T02:56:45Z

From a SQL engine's point of view, running all tasks at once or batch by batch doesn't matter. It doesn't change the semantics of the SQL operator, and the optimizer doesn't care about it. However, mapInPandas is a public API and you are free to define what's the expectation of the is_barrier parameter. To make our life easier, we can just define it as "the tasks of running the pandas function must all be launched at once", and it's not a barrier to the SQL operators. Then I think it's fine to just add a flag to the existing MapInPandas operator.

python/pyspark/sql/pandas/map_ops.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon

Ok lgtm

WeichenXu123 · 2023-03-27T01:42:32Z

merged to master :)

ueshin · 2023-03-28T00:17:48Z

python/pyspark/sql/pandas/map_ops.py

@@ -60,6 +60,7 @@ def mapInPandas(
        schema : :class:`pyspark.sql.types.DataType` or str
            the return type of the `func` in PySpark. The value can be either a
            :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.
+        isBarrier : Use barrier mode execution if True.


We should mark these new arguments by .. versionadded?

Yes we should. with the directive .. versionchanged.

let me just make a quick followup.

…urceProfile ### What changes were proposed in this pull request? Support stage-level scheduling for some PySpark DataFrame APIs (mapInPandas and mapInArrow). ### Why are the changes needed? The introduction of barrier mode in Spark, as seen in #40520, allows for the implementation of Spark ML cases (pure Python algorithms) using DataFrame APIs such as mapInPandas and mapInArrow, so it's necessary to enable stage-level scheduling for DataFrame APIs. ### Does this PR introduce _any_ user-facing change? Yes, This PR adds a new argument "profile" for mapInPandas and mapInArrow. ``` python def mapInPandas( self, func: "PandasMapIterFunction", schema: Union[StructType, str], barrier: bool = False, profile: Optional[ResourceProfile] = None, ) -> "DataFrame": def mapInArrow( self, func: "ArrowMapIterFunction", schema: Union[StructType, str], barrier: bool = False, profile: Optional[ResourceProfile] = None, ) -> "DataFrame": ``` How to use it? take mapInPandas as an example, ``` python from pyspark import TaskContext def func(iterator): tc = TaskContext.get() assert tc.cpus() == 3 for batch in iterator: yield batch df = spark.range(10) from pyspark.resource import TaskResourceRequests, ResourceProfileBuilder treqs = TaskResourceRequests().cpus(3) rp = ResourceProfileBuilder().require(treqs).build df.mapInPandas(func, "id long", False, rp).collect() ``` ### How was this patch tested? The newly added tests can pass, and some manual tests are needed for dynamic allocation on or off. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44852 from wbo4958/df-rp. Lead-authored-by: Bobby Wang <bobwang@nvidia.com> Co-authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

init

4de9d07

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 marked this pull request as draft March 22, 2023 12:05

github-actions bot added CORE PYTHON SQL labels Mar 22, 2023

update

f612bcb

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 changed the title ~~[SPARK-42896][SQL][PYSPARK] Make mapInPandas / mapInArrow` support barrier mode execution~~ [SPARK-42896][SQL][PYSPARK] Make mapInPandas / mapInArrow support barrier mode execution Mar 22, 2023

update

e95fd95

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added the CONNECT label Mar 22, 2023

HyukjinKwon reviewed Mar 22, 2023

View reviewed changes

HyukjinKwon reviewed Mar 23, 2023

View reviewed changes

python/pyspark/sql/pandas/map_ops.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 23, 2023

View reviewed changes

python/pyspark/sql/pandas/map_ops.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 23, 2023

View reviewed changes

python/pyspark/sql/pandas/map_ops.py Show resolved Hide resolved

update

aa7862f

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon changed the title ~~[SPARK-42896][SQL][PYSPARK] Make mapInPandas / mapInArrow support barrier mode execution~~ [SPARK-42896][SQL][PYTHON] Make mapInPandas / mapInArrow support barrier mode execution Mar 24, 2023

WeichenXu123 added 3 commits March 24, 2023 18:17

update

9be4cad

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into barrier-udf

9ebb3f2

fix test

85b840b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 marked this pull request as ready for review March 24, 2023 10:20

update format

66f13cb

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 closed this in 06bf544 Mar 27, 2023

HyukjinKwon reviewed Mar 27, 2023

View reviewed changes

ueshin reviewed Mar 28, 2023

View reviewed changes

This was referenced Jan 26, 2024

[SPARK-46812][SQL][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile #44852

Closed

[FEA] support barrier mode for mapInPandas/mapInArrow NVIDIA/spark-rapids#10344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution #40520

[SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution #40520

WeichenXu123 commented Mar 22, 2023 •

edited

Loading

HyukjinKwon Mar 22, 2023

WeichenXu123 Mar 22, 2023 •

edited

Loading

WeichenXu123 Mar 22, 2023

HyukjinKwon Mar 22, 2023

HyukjinKwon Mar 22, 2023

WeichenXu123 Mar 22, 2023

WeichenXu123 Mar 23, 2023

WeichenXu123 commented Mar 23, 2023 •

edited

Loading

cloud-fan commented Mar 23, 2023 •

edited

Loading

WeichenXu123 commented Mar 23, 2023 •

edited

Loading

HyukjinKwon commented Mar 23, 2023

HyukjinKwon commented Mar 23, 2023 •

edited

Loading

zhengruifeng commented Mar 23, 2023 •

edited

Loading

WeichenXu123 commented Mar 23, 2023

cloud-fan commented Mar 23, 2023

HyukjinKwon left a comment

WeichenXu123 commented Mar 27, 2023

ueshin Mar 28, 2023

HyukjinKwon Mar 28, 2023

HyukjinKwon Mar 28, 2023

HyukjinKwon Mar 28, 2023

[SPARK-42896][SQL][PYTHON] Make mapInPandas / mapInArrow support barrier mode execution #40520

[SPARK-42896][SQL][PYTHON] Make mapInPandas / mapInArrow support barrier mode execution #40520

Conversation

WeichenXu123 commented Mar 22, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

WeichenXu123 Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Mar 23, 2023 • edited Loading

cloud-fan commented Mar 23, 2023 • edited Loading

WeichenXu123 commented Mar 23, 2023 • edited Loading

HyukjinKwon commented Mar 23, 2023

HyukjinKwon commented Mar 23, 2023 • edited Loading

zhengruifeng commented Mar 23, 2023 • edited Loading

WeichenXu123 commented Mar 23, 2023

cloud-fan commented Mar 23, 2023

HyukjinKwon left a comment

Choose a reason for hiding this comment

WeichenXu123 commented Mar 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution #40520

[SPARK-42896][SQL][PYTHON] Make `mapInPandas` / `mapInArrow` support barrier mode execution #40520

WeichenXu123 commented Mar 22, 2023 •

edited

Loading

WeichenXu123 Mar 22, 2023 •

edited

Loading

WeichenXu123 commented Mar 23, 2023 •

edited

Loading

cloud-fan commented Mar 23, 2023 •

edited

Loading

WeichenXu123 commented Mar 23, 2023 •

edited

Loading

HyukjinKwon commented Mar 23, 2023 •

edited

Loading

zhengruifeng commented Mar 23, 2023 •

edited

Loading