[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320

crflynn · 2021-10-19T02:42:38Z

What changes were proposed in this pull request?

These changes update the __repr__ methods of type classes in pyspark.sql.types to print string representations which are eval-able. In other words, any instance of a DataType will produce a repr which can be passed to eval() to create an identical instance.

Similar changes previously submitted: #25495

Why are the changes needed?

This bug has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.

The motivation is "to return a string that would yield an object with the same value when passed to eval()".

Does this PR introduce any user-facing change?

Example:

Current implementation:

from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType(List(StructField(f1,StringType,true)))
new_struct = eval(repr(struct))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'List' is not defined

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField(f1,StringType,true)
new_struct_field = eval(repr(struct_field))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'f1' is not defined

With changes:

from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType([StructField('f1', StringType(), True)])
new_struct = eval(repr(struct))
struct == new_struct
# True

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField('f1', StringType(), True)
new_struct_field = eval(repr(struct_field))
struct_field == new_struct_field
# True

How was this patch tested?

The changes include a test which asserts that an instance of each type is equal to the eval of its repr, as in the above example.

srowen

Seems reasonable to me. I suppose this is a user-facing change and always some concern over breaking something, but, seems like the intent of __repr__ can only be to be evaluatable, right?

srowen · 2021-10-19T15:49:41Z

Jenkins test this please

SparkQA · 2021-10-19T15:52:24Z

Test build #144417 has finished for PR 34320 at commit 36aa8d1.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

crflynn · 2021-10-19T16:11:12Z

I believe I fixed the failed style tests. I'm not sure how to enable the workflow run; GitHub Actions is enabled on my fork.

srowen · 2021-10-19T16:13:27Z

Jenkins retest this please

SparkQA · 2021-10-19T16:44:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48891/

SparkQA · 2021-10-19T16:56:04Z

Test build #144419 has finished for PR 34320 at commit 4064f8d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-19T17:32:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48893/

SparkQA · 2021-10-19T17:33:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48891/

SparkQA · 2021-10-19T18:33:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48893/

crflynn · 2021-10-19T18:43:40Z

I fixed the related dataframe test. I'm not sure how the k8s failures are related to these changes.

srowen · 2021-10-19T18:54:16Z

I think you can ignore that here

srowen · 2021-10-19T18:54:24Z

Jenkins retest this please

SparkQA · 2021-10-19T19:14:23Z

Test build #144427 has finished for PR 34320 at commit b647db2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-19T19:37:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48900/

SparkQA · 2021-10-19T20:36:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48900/

crflynn · 2021-10-21T14:59:15Z

I think I've got everything passing. There were a lot of doctests that needed to be updated.

AmplabJenkins · 2021-10-22T02:46:39Z

Can one of the admins verify this patch?

crflynn · 2021-12-22T18:53:37Z

Updated once more to resolve conflicts.

zero323 · 2021-12-22T19:51:31Z

In general it looks reasonable (it is common to want to reuse inferred schema for example) ‒ my only concern is that it is going to break user snapshot tests and possibly affect things like language-to-language migrations. So if it is going to be merged it has to be included as potentially breaking change in the release notes.

crflynn · 2021-12-29T18:25:18Z

Should we include a note in the migration_guide under pyspark docs?

crflynn · 2022-03-09T00:01:20Z

Just wanted to check-in on the PR status here. If there is still some risk of merging I could always release this as a separate package which patches in the reprs.

zero323 · 2022-03-12T16:56:05Z

cc @HyukjinKwon

srowen

Yeah I don't know how to evaluate the risk of breaks. I think it's reasonable. I agree that a quick note in the migration guide would be safe

HyukjinKwon · 2022-03-14T01:20:07Z

python/pyspark/pandas/spark/utils.py

-ArrayType(DecimalType(30,15),false),false),false),StructField(b,StringType,true))),true),\
-StructField(B,DecimalType(30,15),false)))
+    StructType([StructField('A',
+                            StructType([StructField('a',


Cab we get rid of all these white spaces? or at least two space indentation.

I'll take another look. IIRC there was some nuance with doctests and wrapped code/whitespace that was difficult to work around, which is why it has the odd hanging indents here.

HyukjinKwon · 2022-03-14T01:22:33Z

I kind of like this change .. and can't think of a case this change breaks something for now .. so I am fine with this. By right, yeah __repr__ should return something evaluable though that's not the case in many projects outside in practice up to my best knowledge.

HyukjinKwon · 2022-03-22T09:49:09Z

@crflynn mind running dev/reformat-python script to reformat the codes?

HyukjinKwon

Otherwise, LGTM from me side. It follows what __repr__ is supposed to be according to the official Python docs. cc @BryanCutler @viirya @ueshin FYI

BryanCutler

LGTM

viirya

Looks reasonable. It's nice to add a note into the migration guide.

srowen · 2022-03-23T13:59:49Z

Merged to master/3.3

### What changes were proposed in this pull request? These changes update the `__repr__` methods of type classes in `pyspark.sql.types` to print string representations which are `eval`-able. In other words, any instance of a `DataType` will produce a repr which can be passed to `eval()` to create an identical instance. Similar changes previously submitted: #25495 ### Why are the changes needed? This [bug](https://issues.apache.org/jira/browse/SPARK-18621) has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python. The [motivation](https://docs.python.org/3/library/functions.html#repr) is "to return a string that would yield an object with the same value when passed to eval()". ### Does this PR introduce _any_ user-facing change? Example: Current implementation: ```python from pyspark.sql.types import * struct = StructType([StructField('f1', StringType(), True)]) repr(struct) # StructType(List(StructField(f1,StringType,true))) new_struct = eval(repr(struct)) # Traceback (most recent call last): # File "<input>", line 1, in <module> # File "<string>", line 1, in <module> # NameError: name 'List' is not defined struct_field = StructField('f1', StringType(), True) repr(struct_field) # StructField(f1,StringType,true) new_struct_field = eval(repr(struct_field)) # Traceback (most recent call last): # File "<input>", line 1, in <module> # File "<string>", line 1, in <module> # NameError: name 'f1' is not defined ``` With changes: ```python from pyspark.sql.types import * struct = StructType([StructField('f1', StringType(), True)]) repr(struct) # StructType([StructField('f1', StringType(), True)]) new_struct = eval(repr(struct)) struct == new_struct # True struct_field = StructField('f1', StringType(), True) repr(struct_field) # StructField('f1', StringType(), True) new_struct_field = eval(repr(struct_field)) struct_field == new_struct_field # True ``` ### How was this patch tested? The changes include a test which asserts that an instance of each type is equal to the `eval` of its `repr`, as in the above example. Closes #34320 from crflynn/sql-types-repr. Lead-authored-by: flynn <crf204@gmail.com> Co-authored-by: Flynn <crflynn@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit c5ebdc6) Signed-off-by: Sean Owen <srowen@gmail.com>

github-actions bot added CORE PYTHON SQL labels Oct 19, 2021

srowen reviewed Oct 19, 2021

View reviewed changes

crflynn force-pushed the sql-types-repr branch from a58314c to f1f2388 Compare October 19, 2021 19:47

crflynn added 3 commits October 19, 2021 15:49

make sql type reprs eval-able

177240a

fix indenting

d473f31

fix dataframe test

d710951

crflynn force-pushed the sql-types-repr branch from f1f2388 to 6c51de6 Compare October 19, 2021 19:49

fix doctests

622739f

crflynn force-pushed the sql-types-repr branch from 6c51de6 to 622739f Compare October 19, 2021 19:58

fix more doctests

392d751

github-actions bot added the ML label Oct 19, 2021

crflynn added 2 commits October 19, 2021 17:38

fix lint err

34e7a44

fix more doctests

b99254a

crflynn added 3 commits October 20, 2021 21:32

fix doctest output

aece8fa

fix pandas doctests

a02d86f

fix typehints docstrings

4615139

crflynn added 2 commits December 22, 2021 12:13

merge

5c18907

black fmt

58ddcd0

srowen reviewed Mar 12, 2022

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-18621][PYTHON] make sql type reprs eval-able~~ [SPARK-18621][PYTHON] Make sql type reprs eval-able Mar 14, 2022

HyukjinKwon reviewed Mar 14, 2022

View reviewed changes

cleanup doctests

5cde057

HyukjinKwon approved these changes Mar 22, 2022

View reviewed changes

crflynn added 2 commits March 22, 2022 10:27

format

f6b495d

Merge branch 'master' into sql-types-repr

92677f8

BryanCutler approved these changes Mar 22, 2022

View reviewed changes

viirya approved these changes Mar 22, 2022

View reviewed changes

viirya reviewed Mar 22, 2022

View reviewed changes

add note in migration guide

9baa5b7

srowen approved these changes Mar 23, 2022

View reviewed changes

srowen closed this in c5ebdc6 Mar 23, 2022

memoryz mentioned this pull request Feb 23, 2023

fix: getTensorTypeFromSpark fails for Spark 3.3.0+ onnx/onnxmltools#607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320

[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320

crflynn commented Oct 19, 2021 •

edited

Loading

srowen left a comment •

edited

Loading

srowen commented Oct 19, 2021

SparkQA commented Oct 19, 2021

crflynn commented Oct 19, 2021

srowen commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

crflynn commented Oct 19, 2021

srowen commented Oct 19, 2021

srowen commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

crflynn commented Oct 21, 2021

AmplabJenkins commented Oct 22, 2021

crflynn commented Dec 22, 2021

zero323 commented Dec 22, 2021 •

edited

Loading

crflynn commented Dec 29, 2021

crflynn commented Mar 9, 2022

zero323 commented Mar 12, 2022

srowen left a comment

HyukjinKwon Mar 14, 2022

crflynn Mar 15, 2022 •

edited

Loading

HyukjinKwon commented Mar 14, 2022 •

edited

Loading

HyukjinKwon commented Mar 22, 2022

HyukjinKwon left a comment

BryanCutler left a comment

viirya left a comment

srowen commented Mar 23, 2022

[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320

[SPARK-18621][PYTHON] Make sql type reprs eval-able #34320

Conversation

crflynn commented Oct 19, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

srowen left a comment • edited Loading

Choose a reason for hiding this comment

srowen commented Oct 19, 2021

SparkQA commented Oct 19, 2021

crflynn commented Oct 19, 2021

srowen commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

crflynn commented Oct 19, 2021

srowen commented Oct 19, 2021

srowen commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

SparkQA commented Oct 19, 2021

crflynn commented Oct 21, 2021

AmplabJenkins commented Oct 22, 2021

crflynn commented Dec 22, 2021

zero323 commented Dec 22, 2021 • edited Loading

crflynn commented Dec 29, 2021

crflynn commented Mar 9, 2022

zero323 commented Mar 12, 2022

srowen left a comment

Choose a reason for hiding this comment

HyukjinKwon Mar 14, 2022

Choose a reason for hiding this comment

crflynn Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Mar 14, 2022 • edited Loading

HyukjinKwon commented Mar 22, 2022

HyukjinKwon left a comment

Choose a reason for hiding this comment

BryanCutler left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

srowen commented Mar 23, 2022

crflynn commented Oct 19, 2021 •

edited

Loading

srowen left a comment •

edited

Loading

zero323 commented Dec 22, 2021 •

edited

Loading

crflynn Mar 15, 2022 •

edited

Loading

HyukjinKwon commented Mar 14, 2022 •

edited

Loading