Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Excessive serialization due to member access in mapPartitions closure #1101

Closed
jlowe opened this issue Nov 11, 2020 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Member

jlowe commented Nov 11, 2020

Describe the bug
Related to #1097, there are some instances in the code where the closure passed to mapPartitions is referencing members of the outer class, causing the entire class to be serialized. In many cases the class is a SparkPlan instance, causing most of the query plan objects to be serialized to the task unnecessarily.

Steps/Code to reproduce bug
Examine the source for code calling mapPartitions and the closure referencing class constructor arguments or fields/methods of the instance.

Expected behavior
Values needed in the closure should be cached in a local val just before the mapPartitions call to avoid the need to serialize the entire outer object.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 11, 2020
@jlowe jlowe changed the title [BUG] Excessive serialization due to class member access in mapPartitions closure [BUG] Excessive serialization due to member access in mapPartitions closure Nov 11, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 17, 2020
@jlowe
Copy link
Member Author

jlowe commented Nov 17, 2020

We might be able to temporarily add something unserializable to GpuExec and run the unit tests to catch places where we are excessively serializing the entire plan when we shouldn't be.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…p ci] [bot] (NVIDIA#1101)

* Update submodule cudf to 777c1f4e2307747d33efd755629405e5a5acd4cd

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

* Update submodule cudf to ac2695365c2b594c44a4aeeedff9899df53c0e90

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

* Update submodule cudf to 17a0068c9753c37b30f040866a9a5b5b0bdf8076

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

* Update submodule cudf to 52342789a367153bedbc804efdc0a54a7b6ed083

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

* Update submodule cudf to 5df43673aeceb81004f3643605cfd5ae6f969563

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

---------

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
@jlowe jlowe closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants