Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PYTHON][SQL][WIP] repr(schema) and schema.toString produce runnable code #25495

Closed
wants to merge 2 commits into from

Conversation

dougbateman
Copy link

What changes were proposed in this pull request?

repr(schema) produces runnable python code
schema.toString produce runnable scala code

Why are the changes needed?

Previously, schema.toString produced scala code that wasn't runnable because field-names weren't quoted. Even worse, repr(schema) in python produced the same non-runnable scala code. This resolves both issues, so that runnable Scala and Python are available.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

pyspark/sql/tests/test_types.py now has test_repr()

@@ -49,7 +49,7 @@ case class StructField(
}

// override the default toString to be compatible with legacy parquet files.
override def toString: String = s"StructField($name,$dataType,$nullable)"
override def toString: String = s"""StructField("$name",$dataType,$nullable)"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala sides doesn't have repr contract to make it re-construct-able like Python sides.
Also, I think this can't handle " character in the middle of its name, for instance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the code to confirm this works. The """ allows for the embedded " to work correctly.

True, Scala doesn't have a repr contract for runnable code. However, having toString produce runnable code here has a real use-case for users. Users can inferSchema, get the generated schema code, tweak it as needed, and then provide the schema in the future. I've needed this many times in my projects. I'll make sure I add this to the PR comment. And also open a JIRA.

@HyukjinKwon
Copy link
Member

Can you file a JIRA please?

@dongjoon-hyun dongjoon-hyun changed the title [WIP] repr(schema) and schema.toString produce runnable code [PYSPARK][SQL][WIP] repr(schema) and schema.toString produce runnable code Aug 19, 2019
@dongjoon-hyun dongjoon-hyun changed the title [PYSPARK][SQL][WIP] repr(schema) and schema.toString produce runnable code [PYTHON][SQL][WIP] repr(schema) and schema.toString produce runnable code Aug 19, 2019
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 26, 2019
@github-actions github-actions bot closed this Dec 27, 2019
srowen pushed a commit that referenced this pull request Mar 23, 2022
### What changes were proposed in this pull request?

These changes update the `__repr__` methods of type classes in `pyspark.sql.types` to print string representations which are `eval`-able. In other words, any instance of a `DataType` will produce a repr which can be passed to `eval()` to create an identical instance.

Similar changes previously submitted: #25495

### Why are the changes needed?

This [bug](https://issues.apache.org/jira/browse/SPARK-18621) has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.

The [motivation](https://docs.python.org/3/library/functions.html#repr) is "to return a string that would yield an object with the same value when passed to eval()".

### Does this PR introduce _any_ user-facing change?

Example:

Current implementation:

```python
from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType(List(StructField(f1,StringType,true)))
new_struct = eval(repr(struct))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'List' is not defined

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField(f1,StringType,true)
new_struct_field = eval(repr(struct_field))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'f1' is not defined
```

With changes:

```python
from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType([StructField('f1', StringType(), True)])
new_struct = eval(repr(struct))
struct == new_struct
# True

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField('f1', StringType(), True)
new_struct_field = eval(repr(struct_field))
struct_field == new_struct_field
# True
```

### How was this patch tested?

The changes include a test which asserts that an instance of each type is equal to the `eval` of its `repr`, as in the above example.

Closes #34320 from crflynn/sql-types-repr.

Lead-authored-by: flynn <crf204@gmail.com>
Co-authored-by: Flynn <crflynn@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request Mar 23, 2022
### What changes were proposed in this pull request?

These changes update the `__repr__` methods of type classes in `pyspark.sql.types` to print string representations which are `eval`-able. In other words, any instance of a `DataType` will produce a repr which can be passed to `eval()` to create an identical instance.

Similar changes previously submitted: #25495

### Why are the changes needed?

This [bug](https://issues.apache.org/jira/browse/SPARK-18621) has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.

The [motivation](https://docs.python.org/3/library/functions.html#repr) is "to return a string that would yield an object with the same value when passed to eval()".

### Does this PR introduce _any_ user-facing change?

Example:

Current implementation:

```python
from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType(List(StructField(f1,StringType,true)))
new_struct = eval(repr(struct))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'List' is not defined

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField(f1,StringType,true)
new_struct_field = eval(repr(struct_field))
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "<string>", line 1, in <module>
# NameError: name 'f1' is not defined
```

With changes:

```python
from pyspark.sql.types import *

struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType([StructField('f1', StringType(), True)])
new_struct = eval(repr(struct))
struct == new_struct
# True

struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField('f1', StringType(), True)
new_struct_field = eval(repr(struct_field))
struct_field == new_struct_field
# True
```

### How was this patch tested?

The changes include a test which asserts that an instance of each type is equal to the `eval` of its `repr`, as in the above example.

Closes #34320 from crflynn/sql-types-repr.

Lead-authored-by: flynn <crf204@gmail.com>
Co-authored-by: Flynn <crflynn@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit c5ebdc6)
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants