Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38314][SQL] Fix of failing to read parquet files after writing the hidden file metadata in #35650

Closed
wants to merge 1 commit into from

Conversation

Yaohua628
Copy link
Contributor

@Yaohua628 Yaohua628 commented Feb 24, 2022

What changes were proposed in this pull request?

Selecting and then writing df containing hidden file metadata column _metadata into a file format like parquet, delta will still keep the internal Attribute metadata information. Then when reading those parquet, delta files again, it will actually break the code, because it wrongly thinks user data schema_metadata is a hidden file source metadata column.

// prepare a file source df
df.select("*", "_metadata").write.format("parquet").save(path)

spark.read.format("parquet").load(path).select("*").show()

This PR fixes this by cleaning up any remaining metadata information of output columns.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new UT

@github-actions github-actions bot added the SQL label Feb 24, 2022
@Yaohua628
Copy link
Contributor Author

@cloud-fan please take a look whenever you have a chance, thanks a lot, appreciate the help!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Yaohua628
Copy link
Contributor Author

@cloud-fan hi Wenchen, are we ready to merge this fix? thanks!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 50520fe Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants