-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. I suppose this is a user-facing change and always some concern over breaking something, but, seems like the intent of __repr__
can only be to be evaluatable, right?
Jenkins test this please |
Test build #144417 has finished for PR 34320 at commit
|
I believe I fixed the failed style tests. I'm not sure how to enable the workflow run; GitHub Actions is enabled on my fork. |
Jenkins retest this please |
Kubernetes integration test starting |
Test build #144419 has finished for PR 34320 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
I fixed the related dataframe test. I'm not sure how the k8s failures are related to these changes. |
I think you can ignore that here |
Jenkins retest this please |
Test build #144427 has finished for PR 34320 at commit
|
Kubernetes integration test starting |
a58314c
to
f1f2388
Compare
f1f2388
to
6c51de6
Compare
6c51de6
to
622739f
Compare
Kubernetes integration test status failure |
I think I've got everything passing. There were a lot of doctests that needed to be updated. |
Can one of the admins verify this patch? |
Updated once more to resolve conflicts. |
In general it looks reasonable (it is common to want to reuse inferred schema for example) ‒ my only concern is that it is going to break user snapshot tests and possibly affect things like language-to-language migrations. So if it is going to be merged it has to be included as potentially breaking change in the release notes. |
Should we include a note in the |
Just wanted to check-in on the PR status here. If there is still some risk of merging I could always release this as a separate package which patches in the reprs. |
cc @HyukjinKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't know how to evaluate the risk of breaks. I think it's reasonable. I agree that a quick note in the migration guide would be safe
python/pyspark/pandas/spark/utils.py
Outdated
ArrayType(DecimalType(30,15),false),false),false),StructField(b,StringType,true))),true),\ | ||
StructField(B,DecimalType(30,15),false))) | ||
StructType([StructField('A', | ||
StructType([StructField('a', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cab we get rid of all these white spaces? or at least two space indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take another look. IIRC there was some nuance with doctests and wrapped code/whitespace that was difficult to work around, which is why it has the odd hanging indents here.
I kind of like this change .. and can't think of a case this change breaks something for now .. so I am fine with this. By right, yeah |
@crflynn mind running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM from me side. It follows what __repr__
is supposed to be according to the official Python docs. cc @BryanCutler @viirya @ueshin FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. It's nice to add a note into the migration guide.
Merged to master/3.3 |
### What changes were proposed in this pull request? These changes update the `__repr__` methods of type classes in `pyspark.sql.types` to print string representations which are `eval`-able. In other words, any instance of a `DataType` will produce a repr which can be passed to `eval()` to create an identical instance. Similar changes previously submitted: #25495 ### Why are the changes needed? This [bug](https://issues.apache.org/jira/browse/SPARK-18621) has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python. The [motivation](https://docs.python.org/3/library/functions.html#repr) is "to return a string that would yield an object with the same value when passed to eval()". ### Does this PR introduce _any_ user-facing change? Example: Current implementation: ```python from pyspark.sql.types import * struct = StructType([StructField('f1', StringType(), True)]) repr(struct) # StructType(List(StructField(f1,StringType,true))) new_struct = eval(repr(struct)) # Traceback (most recent call last): # File "<input>", line 1, in <module> # File "<string>", line 1, in <module> # NameError: name 'List' is not defined struct_field = StructField('f1', StringType(), True) repr(struct_field) # StructField(f1,StringType,true) new_struct_field = eval(repr(struct_field)) # Traceback (most recent call last): # File "<input>", line 1, in <module> # File "<string>", line 1, in <module> # NameError: name 'f1' is not defined ``` With changes: ```python from pyspark.sql.types import * struct = StructType([StructField('f1', StringType(), True)]) repr(struct) # StructType([StructField('f1', StringType(), True)]) new_struct = eval(repr(struct)) struct == new_struct # True struct_field = StructField('f1', StringType(), True) repr(struct_field) # StructField('f1', StringType(), True) new_struct_field = eval(repr(struct_field)) struct_field == new_struct_field # True ``` ### How was this patch tested? The changes include a test which asserts that an instance of each type is equal to the `eval` of its `repr`, as in the above example. Closes #34320 from crflynn/sql-types-repr. Lead-authored-by: flynn <crf204@gmail.com> Co-authored-by: Flynn <crflynn@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit c5ebdc6) Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
These changes update the
__repr__
methods of type classes inpyspark.sql.types
to print string representations which areeval
-able. In other words, any instance of aDataType
will produce a repr which can be passed toeval()
to create an identical instance.Similar changes previously submitted: #25495
Why are the changes needed?
This bug has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.
The motivation is "to return a string that would yield an object with the same value when passed to eval()".
Does this PR introduce any user-facing change?
Example:
Current implementation:
With changes:
How was this patch tested?
The changes include a test which asserts that an instance of each type is equal to the
eval
of itsrepr
, as in the above example.