Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dataproc 2.0 test_reading_file_rewritten_with_fastparquet tests failing #9545

Closed
tgravescs opened this issue Oct 25, 2023 · 3 comments · Fixed by #9583
Closed

[BUG] Dataproc 2.0 test_reading_file_rewritten_with_fastparquet tests failing #9545

tgravescs opened this issue Oct 25, 2023 · 3 comments · Fixed by #9583
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
Our integration tests on Dataproc 2.0, the test_reading_file_rewritten_with_fastparquet tests are failing:

2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Byte(not_null)][INJECT_OOM] - FileNotFoundError: [Errno 2] No such fil
e or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Byte][INJECT_OOM] - FileNotFoundError: [Errno 2] No such file or direc
tory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Short(not_null)][INJECT_OOM] - FileNotFoundError: [Errno 2] No such fi
le or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Short][INJECT_OOM] - FileNotFoundError: [Errno 2] No such file or dire
ctory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Integer(not_null)][INJECT_OOM] - FileNotFoundError: [Errno 2] No such 
file or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Integer][INJECT_OOM] - FileNotFoundError: [Errno 2] No such file or di
rectory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Long(not_null)] - FileNotFoundError: [Errno 2] No such file or directo
ry: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Long][INJECT_OOM] - FileNotFoundError: [Errno 2] No such file or direc
tory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Float(not_null)][INJECT_OOM] - FileNotFoundError: [Errno 2] No such fi
le or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Float] - FileNotFoundError: [Errno 2] No such file or directory: '/tmp
/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Double(not_null)] - FileNotFoundError: [Errno 2] No such file or direc
tory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Double] - FileNotFoundError: [Errno 2] No such file or directory: '/tm
p/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Decimal(not_null)(18,0)][INJECT_OOM] - FileNotFoundError: [Errno 2] No
 such file or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_fastparquet[Decimal(18,0)][INJECT_OOM] - FileNotFoundError: [Errno 2] No such file
 or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_rewritten_with_fastparquet[Date(not_null)-int961] - FileNotFoundError: [Errno 2] No such file o
r directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_rewritten_with_fastparquet[Date-int96] - FileNotFoundError: [Errno 2] No such file or directory
: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_rewritten_with_fastparquet[Timestamp(not_null)-int960][INJECT_OOM] - FileNotFoundError: [Errno 
2] No such file or directory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_rewritten_with_fastparquet[Timestamp-int96] - FileNotFoundError: [Errno 2] No such file or dire
ctory: '/tmp/pyspark_tests...
[2023-10-25T16:07:03.534Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_rewritten_with_fastparquet[Struct(not_null)(('first', Integer(not_null)))-int96][INJECT_OOM] - 
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pyspark_tests...


Not sure why it has no such file, maybe there was another crash or failure that caused these?

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 25, 2023
@mythrocks mythrocks self-assigned this Oct 25, 2023
@mythrocks
Copy link
Collaborator

Not sure why it has no such file,

Right. Not clear to me from the logs either. I'll try repro it manually.

@mythrocks
Copy link
Collaborator

Ok, I think I understand the problem. fastparquet has no notion of writing to any other file system than file:///.

This test should be writing to local file system, and then copying the file over to HDFS, for the test.

mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Oct 30, 2023
Fixes NVIDIA#9545.

This commit fixes the `fastparquet` tests to run on Spark clusters where
the `fs.default.name` does not point to the local filesystem.

Before this commit, the `fastparquet` tests assumed that the parquet files
generated for the tests were written to local filesystem, and could be read
from both `fastparquet` and Spark from the same location.  However, this fails
when run against clusters whose default filesystem is HDFS. `fastparquet` can
only read from the local filesystem.

This commit changes the tests as follows:
1. For tests where data is generated by Spark, the data is copied to local
   filesystem before it is read by `fastparquet`.
2. For tests where data is generated by `fastparquet`, the data is copied
   to the default Hadoop filesystem before reading through Spark.

Signed-off-by: MithunR <mythrocks@gmail.com>
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Oct 31, 2023
Fixes NVIDIA#9545.

This commit fixes the `fastparquet` tests to run on Spark clusters where
the `fs.default.name` does not point to the local filesystem.

Before this commit, the `fastparquet` tests assumed that the parquet files
generated for the tests were written to local filesystem, and could be read
from both `fastparquet` and Spark from the same location.  However, this fails
when run against clusters whose default filesystem is HDFS. `fastparquet` can
only read from the local filesystem.

This commit changes the tests as follows:
1. For tests where data is generated by Spark, the data is copied to local
   filesystem before it is read by `fastparquet`.
2. For tests where data is generated by `fastparquet`, the data is copied
   to the default Hadoop filesystem before reading through Spark.

Signed-off-by: MithunR <mythrocks@gmail.com>
@mythrocks mythrocks added test Only impacts tests and removed ? - Needs Triage Need team to review and classify labels Oct 31, 2023
@mythrocks
Copy link
Collaborator

#9583

mythrocks added a commit that referenced this issue Oct 31, 2023
Fixes #9545.

This commit fixes the `fastparquet` tests to run on Spark clusters where
the `fs.default.name` does not point to the local filesystem.

Before this commit, the `fastparquet` tests assumed that the parquet files
generated for the tests were written to local filesystem, and could be read
from both `fastparquet` and Spark from the same location.  However, this fails
when run against clusters whose default filesystem is HDFS. `fastparquet` can
only read from the local filesystem.

This commit changes the tests as follows:
1. For tests where data is generated by Spark, the data is copied to local
   filesystem before it is read by `fastparquet`.
2. For tests where data is generated by `fastparquet`, the data is copied
   to the default Hadoop filesystem before reading through Spark.

Signed-off-by: MithunR <mythrocks@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants