Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading Parquet written with INT96 ts doesn't match the CPU when reading back #1007

Closed
razajafri opened this issue Oct 22, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@razajafri
Copy link
Collaborator

Describe the bug
When running the following test that generates a single column DF, it fails when ts_write is 'INT96'. The dates are different on cpu and gpu

@pytest.mark.parametrize('data_gen', [TimestampGen()], ids=idfn)
@pytest.mark.parametrize('ts_write', ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS'])
@pytest.mark.parametrize('enableVectorized', ['true', 'false'], ids=idfn)
@allow_non_gpu('CollectLimitExec', 'DataWritingCommandExec')
@ignore_order
def test_cache_columnar(spark_tmp_path, data_gen, enableVectorized, ts_write):
    data_path_gpu = spark_tmp_path + '/PARQUET_DATA'
    def read_parquet_cached(data_path):
        def write_read_parquet_cached(spark):
            df = unary_op_df(spark, data_gen)
            df.write.mode('overwrite').parquet(data_path)
            cached = spark.read.parquet(data_path)#.cache()
            cached.count()
            return cached.select(f.col("a"))
        return write_read_parquet_cached
    # rapids-spark doesn't support LEGACY read for parquet
    conf={'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': 'CORRECTED',
          'spark.sql.legacy.parquet.datetimeRebaseModeInRead' : 'CORRECTED',
          'spark.sql.inMemoryColumnarStorage.enableVectorizedReader' : enableVectorized,
          'spark.sql.parquet.outputTimestampType': ts_write}

    assert_gpu_and_cpu_are_equal_collect(read_parquet_cached(data_path_gpu), conf)

Steps/Code to reproduce bug
Run the above test in python

Expected behavior
The above test should pass

@razajafri razajafri added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 22, 2020
@razajafri razajafri changed the title [BUG] Reading Parquet using INT96 fails [BUG] Reading Parquet using INT96 doesn't match the CPU Oct 22, 2020
@razajafri razajafri changed the title [BUG] Reading Parquet using INT96 doesn't match the CPU [BUG] Reading Parquet written with INT96 ts doesn't match the CPU when reading back Oct 22, 2020
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 27, 2020
@sameerz sameerz added this to the Oct 26 - Nov 6 milestone Oct 27, 2020
@jlowe
Copy link
Member

jlowe commented Oct 28, 2020

This is a duplicate of #132. Changing the test to only generate timestamps from 1590 or afterwards allows the test to pass. The issue with timestamps before 1590 is documented and #132 tracks the issue.

@jlowe jlowe closed this as completed Oct 28, 2020
@jlowe jlowe removed the P0 Must have for release label Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants