Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In Spark, we have defined some internal field metadata to help query resolution and compilation. For example, there are quite some field metadata that are related to metadata columns.

However, when we create tables, these internal field metadata can be leaked. This PR updates CTAS/RTAS commands to remove these internal field metadata before creating tables. CREATE/REPLACE TABLE command is fine as users can't generate these internal field metadata via the type string.

Why are the changes needed?

to avoid potential issues, like mistakenly treating a data column as metadata column

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

@github-actions github-actions bot added the SQL label Apr 13, 2023
@cloud-fan
Copy link
Contributor Author

cc @gengliangwang

def removeInternalMetadata(schema: StructType): StructType = {
StructType(schema.map { field =>
val newMetadata = new MetadataBuilder().withMetadata(field.metadata)
.remove(METADATA_COL_ATTR_KEY)
Copy link
Member

@gengliangwang gengliangwang Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: shalll we define an array for all the internal metadata outside the method?
E.g.

val internalMetaData = Seq(
  METADATA_COL_ATTR_KEY,
  QUALIFIED_ACCESS_ONLY,
  ...

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Apr 14, 2023

thanks for review, merging to master!

@cloud-fan cloud-fan closed this in 4c938d6 Apr 14, 2023
val AUTO_GENERATED_ALIAS = "__autoGeneratedAlias"

val INTERNAL_METADATA_KEYS = Seq(
AUTO_GENERATED_ALIAS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these metadata keys only in the top level columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vkorukanti pushed a commit to delta-io/delta that referenced this pull request Apr 20, 2023
[SPARK-43123](apache/spark#40776) fixed an issue that Spark might leak internal metadata, which caused Delta to store Spark's internal metadata in its table schema.

Spark's internal metadata may trigger special behaviors. For example, if a column metadata has `__metadata_col`, it cannot be selected by star. If we leak `__metadata_col` to any column in a Delta table, we won't be able to query this column when using `SELECT *`.

Although [SPARK-43123](apache/spark#40776) fixes the issue in new Spark versions, we might have already persisted internal metadata in some Delta tables. To make these Delta tables readable again, this PR adds an extra step to clean up internal metadata before returning the table schema to Spark.

GitOrigin-RevId: 60eb4046d55e955379c98e409993b33e753c5256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants