Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320

Closed
wants to merge 17 commits into from

Conversation

crflynn
Copy link
Contributor

@crflynn crflynn commented Oct 19, 2021

What changes were proposed in this pull request?

These changes update the __repr__ methods of type classes in pyspark.sql.types to print string representations which are eval-able. In other words, any instance of a DataType will produce a repr which can be passed to eval() to create an identical instance.

Similar changes previously submitted: #25495

Why are the changes needed?

This bug has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.

The motivation is "to return a string that would yield an object with the same value when passed to eval()".

Does this PR introduce any user-facing change?

Example:

Current implementation:

from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType(List(StructField(f1,StringType,true)))
new_struct = eval(repr(struct))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'List' is not defined

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField(f1,StringType,true)
new_struct_field = eval(repr(struct_field))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'f1' is not defined

With changes:

from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType([StructField('f1', StringType(), True)])
new_struct = eval(repr(struct))
struct == new_struct
# True

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField('f1', StringType(), True)
new_struct_field = eval(repr(struct_field))
struct_field == new_struct_field
# True

How was this patch tested?

The changes include a test which asserts that an instance of each type is equal to the eval of its repr, as in the above example.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. I suppose this is a user-facing change and always some concern over breaking something, but, seems like the intent of __repr__ can only be to be evaluatable, right?

@srowen
Copy link
Member

srowen commented Oct 19, 2021

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Test build #144417 has finished for PR 34320 at commit 36aa8d1.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@crflynn
Copy link
Contributor Author

crflynn commented Oct 19, 2021

I believe I fixed the failed style tests. I'm not sure how to enable the workflow run; GitHub Actions is enabled on my fork.

@srowen
Copy link
Member

srowen commented Oct 19, 2021

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48891/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Test build #144419 has finished for PR 34320 at commit 4064f8d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48893/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48891/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48893/

@crflynn
Copy link
Contributor Author

crflynn commented Oct 19, 2021

I fixed the related dataframe test. I'm not sure how the k8s failures are related to these changes.

@srowen
Copy link
Member

srowen commented Oct 19, 2021

I think you can ignore that here

@srowen
Copy link
Member

srowen commented Oct 19, 2021

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Test build #144427 has finished for PR 34320 at commit b647db2.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48900/

@SparkQA
Copy link

SparkQA commented Oct 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48900/

@github-actions github-actions bot added the ML label Oct 19, 2021
@crflynn
Copy link
Contributor Author

crflynn commented Oct 21, 2021

I think I've got everything passing. There were a lot of doctests that needed to be updated.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@crflynn
Copy link
Contributor Author

crflynn commented Dec 22, 2021

Updated once more to resolve conflicts.

@zero323
Copy link
Member

zero323 commented Dec 22, 2021

In general it looks reasonable (it is common to want to reuse inferred schema for example) ‒ my only concern is that it is going to break user snapshot tests and possibly affect things like language-to-language migrations. So if it is going to be merged it has to be included as potentially breaking change in the release notes.

@crflynn
Copy link
Contributor Author

crflynn commented Dec 29, 2021

Should we include a note in the migration_guide under pyspark docs?

@crflynn
Copy link
Contributor Author

crflynn commented Mar 9, 2022

Just wanted to check-in on the PR status here. If there is still some risk of merging I could always release this as a separate package which patches in the reprs.

@zero323
Copy link
Member

zero323 commented Mar 12, 2022

cc @HyukjinKwon

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't know how to evaluate the risk of breaks. I think it's reasonable. I agree that a quick note in the migration guide would be safe

@HyukjinKwon HyukjinKwon changed the title [SPARK-18621][PYTHON] make sql type reprs eval-able [SPARK-18621][PYTHON] Make sql type reprs eval-able Mar 14, 2022
ArrayType(DecimalType(30,15),false),false),false),StructField(b,StringType,true))),true),\
StructField(B,DecimalType(30,15),false)))
StructType([StructField('A',
StructType([StructField('a',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cab we get rid of all these white spaces? or at least two space indentation.

Copy link
Contributor Author

@crflynn crflynn Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another look. IIRC there was some nuance with doctests and wrapped code/whitespace that was difficult to work around, which is why it has the odd hanging indents here.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 14, 2022

I kind of like this change .. and can't think of a case this change breaks something for now .. so I am fine with this. By right, yeah __repr__ should return something evaluable though that's not the case in many projects outside in practice up to my best knowledge.

@HyukjinKwon
Copy link
Member

@crflynn mind running dev/reformat-python script to reformat the codes?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM from me side. It follows what __repr__ is supposed to be according to the official Python docs. cc @BryanCutler @viirya @ueshin FYI

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. It's nice to add a note into the migration guide.

@srowen srowen closed this in c5ebdc6 Mar 23, 2022
@srowen
Copy link
Member

srowen commented Mar 23, 2022

Merged to master/3.3

srowen pushed a commit that referenced this pull request Mar 23, 2022
### What changes were proposed in this pull request?

These changes update the `__repr__` methods of type classes in `pyspark.sql.types` to print string representations which are `eval`-able. In other words, any instance of a `DataType` will produce a repr which can be passed to `eval()` to create an identical instance.

Similar changes previously submitted: #25495

### Why are the changes needed?

This [bug](https://issues.apache.org/jira/browse/SPARK-18621) has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.

The [motivation](https://docs.python.org/3/library/functions.html#repr) is "to return a string that would yield an object with the same value when passed to eval()".

### Does this PR introduce _any_ user-facing change?

Example:

Current implementation:

```python
from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType(List(StructField(f1,StringType,true)))
new_struct = eval(repr(struct))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'List' is not defined

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField(f1,StringType,true)
new_struct_field = eval(repr(struct_field))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'f1' is not defined
```

With changes:

```python
from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType([StructField('f1', StringType(), True)])
new_struct = eval(repr(struct))
struct == new_struct
# True

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField('f1', StringType(), True)
new_struct_field = eval(repr(struct_field))
struct_field == new_struct_field
# True
```

### How was this patch tested?

The changes include a test which asserts that an instance of each type is equal to the `eval` of its `repr`, as in the above example.

Closes #34320 from crflynn/sql-types-repr.

Lead-authored-by: flynn <crf204@gmail.com>
Co-authored-by: Flynn <crflynn@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit c5ebdc6)
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants