Spark DataFrame write fails if input dataframe has columns in different order than iceberg schema #741

ravichinoy · 2020-01-17T21:45:22Z

For this test case, https://github.com/apache/incubator-iceberg/blob/6f28abfa62838d531be4faa93273965665af933d/spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java

if I replace https://github.com/apache/incubator-iceberg/blob/6f28abfa62838d531be4faa93273965665af933d/spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java#L135 with

df.select("data", "id").write()

the test case fails with below error,

Cannot write incompatible dataset to table with schema:
table {
1: id: optional int
2: data: optional string
}
Problems:

data is out of order, before id
java.lang.IllegalArgumentException: Cannot write incompatible dataset to table with schema:
table {
1: id: optional int
2: data: optional string
}
Problems:
data is out of order, before id

However if I set checkOrdering to false in here, https://github.com/apache/incubator-iceberg/blob/949c6a98ac80acec10568070772082c1178eb739/api/src/main/java/org/apache/iceberg/types/CheckCompatibility.java

Result rows should match expected:<[{"id"=1,"data"="a"}, {"id"=2,"data"="b"}, {"id"=3,"data"="c"}, {"id"=4,"data"="null"}]> but was:<[{"id"=1,"data"=""}, {"id"=2,"data"=""}, {"id"=3,"data"=""}, {"id"=4,"data"="�"}]>
Expected :[{"id"=1,"data"="a"}, {"id"=2,"data"="b"}, {"id"=3,"data"="c"}, {"id"=4,"data"="null"}]
Actual :[{"id"=1,"data"=""}, {"id"=2,"data"=""}, {"id"=3,"data"=""}, {"id"=4,"data"="�"}]

this is because PartitionSpec accessors are being built out of iceberg schema. If we update the code to build the accessors from the input schema, then the re-order test case passes.

this is shown in below PR

#745

We are trying to understand if there is any specific reason to set checkOrdering to false by default and not expose it as a parameter and build the accessors in PartitionSpec from table schema instead of input schema.

And If possible , we would like to enable checkOrdering as a configurable parameter so that it can be turned off and write jobs do not have to use the same ordering as Iceberg Table.

holdenk · 2020-09-29T17:55:41Z

Was this resolved in #745 ?

amitmittal5 · 2023-12-28T21:21:10Z

Hello, is this issue resolved? I am still getting this issue in iceberg 1.4.2 while trying to write in iceberg format to ADLS using spark-streaming.

amitmittal5 · 2024-01-03T13:46:11Z

Hello, is this issue resolved? I am still getting this issue in iceberg 1.4.2 while trying to write in iceberg format to ADLS using spark-streaming.

It was actually resolved earlier but I overlooked. To ignore the column ordering, add the spark write config "spark.sql.iceberg.check-ordering" as "false". It is used in org.apache.iceberg.spark.source.SparkWriteBuilder.validateOrMergeWriteSchema(), if it is not provided, the default is true and checks for ordering.

github-actions · 2024-07-02T00:12:41Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

ravichinoy mentioned this issue Jan 22, 2020

schema evolution support #745

Merged

sunchao pushed a commit to sunchao/iceberg that referenced this issue May 9, 2023

Support Boson in apple-0.13.x (apache#741)

7eebe16

github-actions bot added the stale label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark DataFrame write fails if input dataframe has columns in different order than iceberg schema #741

Spark DataFrame write fails if input dataframe has columns in different order than iceberg schema #741

ravichinoy commented Jan 17, 2020 •

edited

Loading

holdenk commented Sep 29, 2020

amitmittal5 commented Dec 28, 2023

amitmittal5 commented Jan 3, 2024 •

edited

Loading

github-actions bot commented Jul 2, 2024

Spark DataFrame write fails if input dataframe has columns in different order than iceberg schema #741

Spark DataFrame write fails if input dataframe has columns in different order than iceberg schema #741

Comments

ravichinoy commented Jan 17, 2020 • edited Loading

holdenk commented Sep 29, 2020

amitmittal5 commented Dec 28, 2023

amitmittal5 commented Jan 3, 2024 • edited Loading

github-actions bot commented Jul 2, 2024

ravichinoy commented Jan 17, 2020 •

edited

Loading

amitmittal5 commented Jan 3, 2024 •

edited

Loading