[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars #9697

firestarman · 2023-11-14T08:13:07Z

The multi-threaded reader and coalescing reader for Iceberg fail to read the file if its path contains encoded URL unsafe chars.
Here is a repro case.
First, create an Iceberg table by leveraging our bigDataGen in the local database.

import org.apache.spark.sql.tests.datagen._

val dbgen = new DBGen()
dbgen.addTable("download_100b", """local_create_dt string, client_name string, other long""", 1000000L)
dbgen("download_100b")("local_create_dt").setSeedRange(0, 10)
dbgen("download_100b")("client_name").setSeedRange(0, 10)
val df = dbgen("download_100b").toDF(spark).repartition(col("local_create_dt"), col("client_name"))

spark.sql("""CREATE TABLE local.download_100b (local_create_dt string, client_name string, other long) using iceberg partitioned by (local_create_dt, client_name)""")
df.writeTo("local.download_100b").append()

Then launch a GPU enabled spark-shell and read the table by Iceberg scan.

sql("select * from local.download_100b").show

It will complain the below error.

23/11/14 08:08:37 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (liangcail-ubuntu20 executor driver): java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File /data/tmp/local/download_100b/data/local_create_dt=)>tkiudF4</client_name=>{%nzw`HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet does not exist
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$7(GpuMultiFileReader.scala:1135)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$7$adapted(GpuMultiFileReader.scala:1134)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)

The actual file path in local disk is /data/tmp/local/download_100b/data/local_create_dt=%29%3EtkiudF4%3C/client_name=%3E%7B%25nzw%60HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet

The text was updated successfully, but these errors were encountered:

firestarman · 2023-11-14T08:39:20Z

After some investigation, I found the file path from Iceberg is the same as its actual path, but the file path from Spark is url encoded again, although the original one is already url encoded by Iceberg writer. One part of the file path conversion from disk to the multiple readers is
For Spark: %29%3EtkiudF4%3C -> %2529%253EtkiudF4%253C -> %29%3EtkiudF4%3C
For Iceberg: %29%3EtkiudF4%3C -> %29%3EtkiudF4%3C -> )>tkiudF4<

So Spark readers can read the file but Iceberg multi readers can not. Because the file path string changes. Seems we should not decode the file path for Iceberg reads.

firestarman added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 14, 2023

firestarman changed the title ~~[BUG] Iceberg multiple file readers can not read files if the file pathes contain encoded URL unsafe chars~~ [BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars Nov 14, 2023

firestarman self-assigned this Nov 14, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 14, 2023

firestarman mentioned this issue Nov 15, 2023

Encode the file path from Iceberg when converting to a PartitionedFile [databricks] #9717

Merged

jlowe closed this as completed in #9717 Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars #9697

[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars #9697

firestarman commented Nov 14, 2023 •

edited

Loading

firestarman commented Nov 14, 2023 •

edited

Loading

[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars #9697

[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars #9697

Comments

firestarman commented Nov 14, 2023 • edited Loading

firestarman commented Nov 14, 2023 • edited Loading

firestarman commented Nov 14, 2023 •

edited

Loading

firestarman commented Nov 14, 2023 •

edited

Loading