Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42896][SQL][PYTHON] Make
mapInPandas
/mapInArrow
support barrier mode execution #40520[SPARK-42896][SQL][PYTHON] Make
mapInPandas
/mapInArrow
support barrier mode execution #40520Changes from 3 commits
4de9d07
f612bcb
e95fd95
aa7862f
9be4cad
9ebb3f2
85b840b
66f13cb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would require an implementation much complicated than this actually. The current implementation might work for now because these specific physical plans don't move around much for now but the implementation is flaky because the physical plans can change via Catalyst Optimizer (e.g., predicate pushdown) but the barrier execution mode requires that the barrier is created exactly the location where it's invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean pushdown like this ?
to
?
I don't think we should allow pushdown in this case.
MapInPandas
executes arbitrary user code. Can we modify optimizer code to prevent it change the plan when it finds "is_barrier" set ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature
mapInPandas / mapInArrow
supporting barrier mode is only for ML use case, we don't need to complicate it. We only need to support such scenario:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the optimizers can might switch the physical plan order in the future, and that's why the implementation is flaky. Now this works because we don't switch the order anywhere else.
To have a complete implementation, we should have something like #19873 which brought a lot of side effect, and it got reverted in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only for internal purpose, let's probably don't expose
is_barrier: bool = False
to the API surface at least.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simply the implementation, we can implement a
barrierMapInPandasAndCollect
instead, and define a execution plan stage likeBarrierMapInPandasAndCollectExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It should be public API. Third-party lib such as xgboost also need to use it.