[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778

gerashegalov · 2023-07-22T15:25:11Z

Describe the bug
GPU Parquet timestamp is misinterpreted as nanos by fastparquet. Output produced by Identical code on the CPU is interpreted by fastparquet correctly.

Steps/Code to reproduce bug

df = spark.createDataFrame([(datetime.datetime(3023, 7, 14, 7, 38, 45, 418688),)], 'ts timestamp')
cpu_path = tempfile.mkdtemp("cpu_ts")
gpu_path = tempfile.mkdtemp("gpu_ts")
spark.conf.set('spark.sql.parquet.outputTimestampType', 'TIMESTAMP_MICROS')

On CPU, Spark and fastparquet are consistent

spark.conf.set('spark.rapids.sql.enabled', False)
df.write.mode('overwrite').parquet(cpu_path)
cpu_file, = glob.glob(f"{cpu_path}/*.parquet")
spark.read.parquet(cpu_path).show(truncate = False)
fastparquet.ParquetFile(cpu_file).head(1)

Out

+--------------------------+
|ts                        |
+--------------------------+
|3023-07-14 07:38:45.418688|
+--------------------------+

                          ts
0 3023-07-14 07:38:45.418688

GPU's output appears corrupt when read by fastparquet

spark.conf.set('spark.rapids.sql.enabled', True)
df.write.mode('overwrite').parquet(gpu_path)
gpu_file, = glob.glob(f"{gpu_path}/*.parquet")
spark.read.parquet(gpu_path).show(truncate = False)
fastparquet.ParquetFile(gpu_file).head(1)

Out

+--------------------------+
|ts                        |
+--------------------------+
|3023-07-14 07:38:45.418688|
+--------------------------+

                             ts
0 1854-06-04 08:29:37.999584768

The issue appears to be in the GPU case, fast parquet assumes logical time unit nanos

OrderedDict([('max_ts', dtype('<M8[ns]')), ('max_big_ts', dtype('<M8[ns]'))])

because unlike in the CPU case GPU output does not have the logicalType metadata

fastparquet.ParquetFile(cpu_file).fmd

...
 logicalType:
    BSON: null
    DATE: null
    DECIMAL: null
    ENUM: null
    INTEGER: null
    JSON: null
    LIST: null
    MAP: null
    STRING: null
    TIME: null
    TIMESTAMP:
      isAdjustedToUTC: true
      unit:
        MICROS: {}
        MILLIS: null
        NANOS: null
    UNKNOWN: null
    UUID: null

Expected behavior
spark-rapids should be interoperable with non-Spark parquet readers, at least with the ones that work with upstream Spark

Environment details (please complete the following information)

Environment location: any
Spark configuration settings related to the issue: spark.sql.parquet.outputTimestampType='TIMESTAMP_MICROS'

Additional context
encountered working on #8625

The text was updated successfully, but these errors were encountered:

revans2 · 2023-07-24T13:48:26Z

The LogicalType was added more recently and the CUDF parquet writer does not support it. We should be tagging the column with TIMESTAMP_MILLIS or TIMESTAMP_MICROS if it is not in nanoseconds, which is the default.

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype

https://github.com/apache/parquet-format/blob/1603152f8991809e8ad29659dffa224b4284f31b/src/main/thrift/parquet.thrift#L106-L120

gerashegalov · 2023-07-24T17:48:39Z

Repro notebook https://github.com/gerashegalov/rapids-shell/blob/master/src/jupyter/timesteam_micros.ipynb

revans2 · 2023-07-25T21:59:23Z

@gerashegalov what version of fastparquet, numpy and pandas are you using?

When I try to read the CPU file with fast parquet I get an error

>>> fp_file_cpu.head(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "python3.7/site-packages/fastparquet/api.py", line 298, in head
    return self[:i+1].to_pandas(**kwargs).head(nrows)
  File "python3.7/site-packages/fastparquet/api.py", line 753, in to_pandas
    row_filter=sel, infile=infile)
  File "python3.7/site-packages/fastparquet/api.py", line 365, in read_row_group_file
    row_filter=row_filter
  File "python3.7/site-packages/fastparquet/core.py", line 609, in read_row_group
    cats, selfmade, assign=assign, row_filter=row_filter)
  File "python3.7/site-packages/fastparquet/core.py", line 583, in read_row_group_arrays
    row_filter=row_filter)
  File "python3.7/site-packages/fastparquet/core.py", line 551, in read_col
    piece[:] = convert(val, se)
  File "python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 373, in __setitem__
    super().__setitem__(key, value)
  File "python3.7/site-packages/pandas/core/arrays/_mixins.py", line 182, in __setitem__
    value = self._validate_setitem_value(value)
  File "python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 745, in _validate_setitem_value
    return self._unbox(value, setitem=True)
  File "python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 757, in _unbox
    self._check_compatible_with(other, setitem=setitem)
  File "python3.7/site-packages/pandas/core/arrays/datetimes.py", line 505, in _check_compatible_with
    if not timezones.tz_compare(self.tz, other.tz):
AttributeError: 'numpy.ndarray' object has no attribute 'tz'

I am running with

fastparquet 0.8.1
numpy 1.21.6
pandas 1.3.5

Also I get different results for the GPU file too. fastparquet gives me an overflow error

>>> fp_file.head(1)
OverflowError: value too large
Exception ignored in: 'fastparquet.cencoding.time_shift'
OverflowError: value too large
                             ts
0 1971-01-20 19:04:07.125418688

Not sure what is happening here. I could see that the footers are tagged equivalently, but it is clear that fastparquet is taking a different path to parse the GPU file vs the CPU file because the GPU one does not get the error that the CPU does. When I try to read them using pandas I get a very similar error about overflow, but it looks like the CPU version has set the isAdjustedToUTC to be true, and that might be the difference between them.

GPU Error:

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33246247125418688

CPU Error:

pyarrow.lib.ArrowInvalid: Casting from timestamp[us, tz=UTC] to timestamp[ns] would result in out of bounds timestamp: 33246247125418688

gerashegalov · 2023-07-25T22:49:15Z

@revans2 good point, should have listed versions in my venv

for p in [fastparquet, numpy, pandas]:
    print(f"name={p.__name__} version={p.__version__}\n")

name=fastparquet version=2023.7.0

name=numpy version=1.25.1

name=pandas version=2.0.3

added pip list to https://github.com/gerashegalov/rapids-shell/blob/557a96c450a307a206330410b335d346d3cc4170/src/jupyter/timestamp_micros.ipynb

revans2 · 2023-07-26T14:24:28Z

I have reproduced the issue and gone through it several times. It appears to be a bug in fastparquet and how they compute a large timestamp from a V1 file. CUDF is still spitting out parquet files with a V1 footer. When I use Spark 3.1.1 to write the file (which also writes them out with a V1 footer) then I get the exact same result. fastparquet thinks it is from 1854.

>>> import fastparquet
>>> cpu_file = fastparquet.ParquetFile("TMP_CPU_311/part-00000-9fcbe985-36aa-4765-86a0-47a2c6cc4926-c000.snappy.parquet")
>>> cpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
>>> gpu_file = fastparquet.ParquetFile("TMP_GPU/part-00000-98db0b25-66e1-48c2-91bd-ac78f2ac30ee-c000.snappy.parquet")
>>> gpu_file.head(1)
                             ts
0 1854-06-04 13:29:37.999584768
>>> newer_cpu_file = fastparquet.ParquetFile("TMP_CPU_330/part-00000-7aaa467a-aa1b-43db-8102-c604b9c04862-c000.snappy.parquet")
>>> newer_cpu_file.head(1)
                          ts
0 3023-07-14 12:38:45.418688

@gerashegalov do you want me to file an issue against fastparquet?

Just FYI CUDF is in the process of going to V2 for writes, eventually. rapidsai/cudf#13501

gerashegalov · 2023-07-26T14:28:27Z

do you want me to file an issue against fastparquet?

yes feel free to file a fastparquet issue @revans2

revans2 · 2023-07-26T14:45:48Z

Done. dask/fastparquet#872

revans2 · 2023-07-26T14:47:54Z

@sameerz should we document this, or do we just close this issue because it is a bug in fastparquet.

sameerz · 2023-07-28T00:53:00Z

I am inclined to close this as it is a bug in fastparquet.

gerashegalov · 2023-07-28T05:36:26Z

Superseded by the issue dask/fastparquet#872. Thanks @revans2 for investigating.

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 22, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 25, 2023

mattahrens assigned revans2 Jul 25, 2023

revans2 mentioned this issue Jul 26, 2023

Incorrect value returned for overflow timestamps in micros format for V1 footers dask/fastparquet#872

Closed

gerashegalov closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778

gerashegalov commented Jul 22, 2023

revans2 commented Jul 24, 2023

gerashegalov commented Jul 24, 2023

revans2 commented Jul 25, 2023

gerashegalov commented Jul 25, 2023 •

edited

Loading

revans2 commented Jul 26, 2023

gerashegalov commented Jul 26, 2023

revans2 commented Jul 26, 2023

revans2 commented Jul 26, 2023

sameerz commented Jul 28, 2023

gerashegalov commented Jul 28, 2023

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778

[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778

Comments

gerashegalov commented Jul 22, 2023

On CPU, Spark and fastparquet are consistent

GPU's output appears corrupt when read by fastparquet

revans2 commented Jul 24, 2023

gerashegalov commented Jul 24, 2023

revans2 commented Jul 25, 2023

gerashegalov commented Jul 25, 2023 • edited Loading

revans2 commented Jul 26, 2023

gerashegalov commented Jul 26, 2023

revans2 commented Jul 26, 2023

revans2 commented Jul 26, 2023

sameerz commented Jul 28, 2023

gerashegalov commented Jul 28, 2023

gerashegalov commented Jul 25, 2023 •

edited

Loading