Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars #9697

Closed
firestarman opened this issue Nov 14, 2023 · 1 comment · Fixed by #9717
Assignees
Labels
bug Something isn't working

Comments

@firestarman
Copy link
Collaborator

firestarman commented Nov 14, 2023

The multi-threaded reader and coalescing reader for Iceberg fail to read the file if its path contains encoded URL unsafe chars.
Here is a repro case.
First, create an Iceberg table by leveraging our bigDataGen in the local database.

import org.apache.spark.sql.tests.datagen._

val dbgen = new DBGen()
dbgen.addTable("download_100b", """local_create_dt string, client_name string, other long""", 1000000L)
dbgen("download_100b")("local_create_dt").setSeedRange(0, 10)
dbgen("download_100b")("client_name").setSeedRange(0, 10)
val df = dbgen("download_100b").toDF(spark).repartition(col("local_create_dt"), col("client_name"))

spark.sql("""CREATE TABLE local.download_100b (local_create_dt string, client_name string, other long) using iceberg partitioned by (local_create_dt, client_name)""")
df.writeTo("local.download_100b").append()

Then launch a GPU enabled spark-shell and read the table by Iceberg scan.

sql("select * from local.download_100b").show

It will complain the below error.

23/11/14 08:08:37 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (liangcail-ubuntu20 executor driver): java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File /data/tmp/local/download_100b/data/local_create_dt=)>tkiudF4</client_name=>{%nzw`HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet does not exist
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$7(GpuMultiFileReader.scala:1135)
	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$7$adapted(GpuMultiFileReader.scala:1134)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)

The actual file path in local disk is /data/tmp/local/download_100b/data/local_create_dt=%29%3EtkiudF4%3C/client_name=%3E%7B%25nzw%60HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet

@firestarman firestarman added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 14, 2023
@firestarman firestarman changed the title [BUG] Iceberg multiple file readers can not read files if the file pathes contain encoded URL unsafe chars [BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars Nov 14, 2023
@firestarman firestarman self-assigned this Nov 14, 2023
@firestarman
Copy link
Collaborator Author

firestarman commented Nov 14, 2023

After some investigation, I found the file path from Iceberg is the same as its actual path, but the file path from Spark is url encoded again, although the original one is already url encoded by Iceberg writer. One part of the file path conversion from disk to the multiple readers is
For Spark: %29%3EtkiudF4%3C -> %2529%253EtkiudF4%253C -> %29%3EtkiudF4%3C
For Iceberg: %29%3EtkiudF4%3C -> %29%3EtkiudF4%3C -> )>tkiudF4<

So Spark readers can read the file but Iceberg multi readers can not. Because the file path string changes. Seems we should not decode the file path for Iceberg reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants