You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The multi-threaded reader and coalescing reader for Iceberg fail to read the file if its path contains encoded URL unsafe chars.
Here is a repro case.
First, create an Iceberg table by leveraging our bigDataGen in the local database.
import org.apache.spark.sql.tests.datagen._
val dbgen = new DBGen()
dbgen.addTable("download_100b", """local_create_dt string, client_name string, other long""", 1000000L)
dbgen("download_100b")("local_create_dt").setSeedRange(0, 10)
dbgen("download_100b")("client_name").setSeedRange(0, 10)
val df = dbgen("download_100b").toDF(spark).repartition(col("local_create_dt"), col("client_name"))
spark.sql("""CREATE TABLE local.download_100b (local_create_dt string, client_name string, other long) using iceberg partitioned by (local_create_dt, client_name)""")
df.writeTo("local.download_100b").append()
Then launch a GPU enabled spark-shell and read the table by Iceberg scan.
sql("select * from local.download_100b").show
It will complain the below error.
23/11/14 08:08:37 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (liangcail-ubuntu20 executor driver): java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File /data/tmp/local/download_100b/data/local_create_dt=)>tkiudF4</client_name=>{%nzw`HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet does not exist
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$7(GpuMultiFileReader.scala:1135)
at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readPartFiles$7$adapted(GpuMultiFileReader.scala:1134)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
The actual file path in local disk is /data/tmp/local/download_100b/data/local_create_dt=%29%3EtkiudF4%3C/client_name=%3E%7B%25nzw%60HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet
The text was updated successfully, but these errors were encountered:
firestarman
changed the title
[BUG] Iceberg multiple file readers can not read files if the file pathes contain encoded URL unsafe chars
[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars
Nov 14, 2023
After some investigation, I found the file path from Iceberg is the same as its actual path, but the file path from Spark is url encoded again, although the original one is already url encoded by Iceberg writer. One part of the file path conversion from disk to the multiple readers is
For Spark: %29%3EtkiudF4%3C -> %2529%253EtkiudF4%253C -> %29%3EtkiudF4%3C
For Iceberg: %29%3EtkiudF4%3C -> %29%3EtkiudF4%3C -> )>tkiudF4<
So Spark readers can read the file but Iceberg multi readers can not. Because the file path string changes. Seems we should not decode the file path for Iceberg reads.
The multi-threaded reader and coalescing reader for Iceberg fail to read the file if its path contains encoded URL unsafe chars.
Here is a repro case.
First, create an Iceberg table by leveraging our bigDataGen in the local database.
Then launch a GPU enabled spark-shell and read the table by Iceberg scan.
It will complain the below error.
The actual file path in local disk is
/data/tmp/local/download_100b/data/local_create_dt=%29%3EtkiudF4%3C/client_name=%3E%7B%25nzw%60HO*/00000-12-09af04cb-4423-49ef-a925-b85c0de28c39-00001.parquet
The text was updated successfully, but these errors were encountered: