Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] loading SPARK-32639 example parquet file triggers a JVM crash #1576

Closed
gerashegalov opened this issue Jan 25, 2021 · 1 comment · Fixed by #1661
Closed

[BUG] loading SPARK-32639 example parquet file triggers a JVM crash #1576

gerashegalov opened this issue Jan 25, 2021 · 1 comment · Fixed by #1661
Assignees
Labels
bug Something isn't working

Comments

@gerashegalov
Copy link
Collaborator

Describe the bug
Replaying the scenario from #1463 (SPARK-32639) in GPU-enabled Scala Spark or pyspark shell results in SIGSEGV in the executor call path

C  [cudf_io5390148209696334138.so+0x170830]  cudf::io::detail::parquet::reader::impl::decode_page_data(hostdevice_vector<cudf::io::parquet::gpu::ColumnChunkDesc>&, hostdevice_vector<cudf::io::parquet::gpu::PageInfo>&, hostdevice_vec
tor<cudf::io::parquet::gpu::PageNestingInfo>&, unsigned long, unsigned long, rmm::cuda_stream_view)+0x4b0

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  ai.rapids.cudf.Table.readParquet([Ljava/lang/String;Ljava/lang/String;JJIZ)[J+0
j  ai.rapids.cudf.Table.readParquet(Lai/rapids/cudf/ParquetOptions;Lai/rapids/cudf/HostMemoryBuffer;JJ)Lai/rapids/cudf/Table;+122
j  com.nvidia.spark.rapids.MultiFileParquetPartitionReader.$anonfun$readToTable$1(Lai/rapids/cudf/ParquetOptions;Lai/rapids/cudf/HostMemoryBuffer;JLcom/nvidia/spark/rapids/NvtxWithMetrics;)Lai/rapids/cudf/Table;+4
j  com.nvidia.spark.rapids.MultiFileParquetPartitionReader$$Lambda$2380.apply(Ljava/lang/Object;)Ljava/lang/Object;+16
j  com.nvidia.spark.rapids.Arm.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object;+2

Steps/Code to reproduce bug

  1. Start (py)Spark REPL with RAPIDS plugin and GPU enabled

  2. load the file from attached to SPARK-32639

spark.read.schema('value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>').parquet("/home/gshegalov/gits/spark-rapids/integration_tests/src/test/resources/SPARK-32639/000.snappy.parquet").take(1)

Expected behavior
In Spark rc 3.1.1 the data loads fine on CPU and should do the same on GPU:

>>> spark.read.schema('value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>').parquet("/home/gshegalov/gits/spark-rapids/integration_tests/src/test/resources/SPARK-32639/000.snappy.parquet").take(1)
[Row(value={Row(first='John', middle='Y.', last='Doe'): 'brother'})]

Environment details (please complete the following information)

  • Environment location: Local master/shell
  • Spark configuration settings related to the issue:
	--conf spark.plugins=com.nvidia.spark.SQLPlugin \
	--conf spark.rapids.sql.enabled=true \
	--jars  ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}
@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 25, 2021
gerashegalov added a commit that referenced this issue Jan 26, 2021
Add a test documenting the scenario failing on Spark (CPU) prior to Spark 3.1.0, closes #1463.  

On GPU, executor JVM crash  still #1576. Since xfail does not handle the process crash gracefully and fails pytest, skip is used instead.

Signed-off-by: Gera Shegalov <gera@apache.org>
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 26, 2021
@jlowe
Copy link
Member

jlowe commented Feb 3, 2021

cudf fixed the crash in rapidsai/cudf#7229. We just need to re-enable the test when the cudf jar picks up that fix.

nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Add a test documenting the scenario failing on Spark (CPU) prior to Spark 3.1.0, closes NVIDIA#1463.  

On GPU, executor JVM crash  still NVIDIA#1576. Since xfail does not handle the process crash gracefully and fails pytest, skip is used instead.

Signed-off-by: Gera Shegalov <gera@apache.org>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Add a test documenting the scenario failing on Spark (CPU) prior to Spark 3.1.0, closes NVIDIA#1463.  

On GPU, executor JVM crash  still NVIDIA#1576. Since xfail does not handle the process crash gracefully and fails pytest, skip is used instead.

Signed-off-by: Gera Shegalov <gera@apache.org>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
… handling of multiple GPUs by Docker (NVIDIA#1576)

Signed-off-by: Navin Kumar <navink@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants