Fix `TimestampGen` to generate value not too close to the minimum allowed timestamp [databricks] #9736

ttnghia · 2023-11-16T00:20:48Z

This modifies the TimestampGen class to generate values one month further from the minimum allowed value. That is because when the timestamps fall in 0001-01 month may cause issue when reading from Spark into Pyspark:

self = TimestampType, ts = -62135596800000000

    def fromInternal(self, ts):
        if ts is not None:
            # using int to avoid precision loss in float
>           return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
E           ValueError: year 0 is out of range

Closes #9701.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

# Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

…IA#9617)" This reverts commit 401d0d8. Signed-off-by: Nghia Truong <nghiat@nvidia.com> # Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

# Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia · 2023-11-16T00:30:27Z

build

ttnghia · 2023-11-16T00:30:55Z

build

# Conflicts: # integration_tests/src/main/python/parquet_test.py # integration_tests/src/main/python/parquet_write_test.py

ttnghia · 2023-11-16T17:07:47Z

build

revans2 · 2023-11-16T17:20:52Z

I'm not sure that this is 100% what we want. Would it be better to do a date_diff in Spark and convert them all to Integers? And for timestamps could we use unix_micros to get the results out as longs?

@jlowe @abellina what do you think? Is it better to restrict the range of timestamps/dates and risk that we have some odd corner case where we copy it back to the host. Or is it better to try and "normalize" the data into something that python does not have issues with?

abellina · 2023-11-16T17:25:43Z

Reference in new is

I would do both. One case we convert to longs and keep the range, the other we don't change the query and adjust as @ttnghia has. That way we don't add extra layers that could hide an issue? My two cents.

revans2 · 2023-11-16T17:36:50Z

I would do both. One case we convert to longs and keep the range, the other we don't change the query and adjust as @ttnghia has. That way we don't add extra layers that could hide an issue? My two cents.

I'm not sure how to do that. Converting all timestamps/dates to a different types is something that we can have an annotation for and programmatically insert it into the plan before we do a collect. I don't know now to coordinate that with datagen. But Lets check this in as is, and I'll file a follow on issue to explore if there is a better way to handle this.

revans2 · 2023-11-16T17:43:47Z

I filed #9747

ttnghia · 2023-11-16T22:48:31Z

build

ttnghia and others added 30 commits August 28, 2023 16:15

Add check for nested types

c578a64

Add check for nested types

e368aa6

Recursively check for rebasing

7da416b

Extract common code

df8f861

Allow nested type in rebase check

95d19ee

Enable nested timestamp in roundtrip test

b426610

Fix another test

7343b17

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Merge branch 'check_rebase_nested' into rebase_datatime

0d48f57

Enable LEGACY rebase in read

024e6c9

Remove comment

9a39628

Change function/class signatures

e686bb0

Merge branch 'branch-23.12' into rebase_datatime

b49963e

# Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Complete modification

2c232f8

Misc

ac0f3e4

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add explicit type

c773794

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Rename file and add some stuff in DateTimeRebaseHelpers.scala

29df7cd

Move file and rename class

1b5112d

Adopt new enum type

63342a9

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add name for the enum classes

6b2d795

Change exception messages

37aa40b

Merge branch 'branch-23.12' into refactor_parquet_scan

d4cdc1b

Does not yet support legacy rebase in read

03f681e

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Change legacy to corrected mode

14f230f

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Extract common code

1b464ec

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Rename functions

0d26d97

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Reformat

c2504fd

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Make classes serializable

edb6c81

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Revert "Support rebase checking for nested dates and timestamps (NVID…

ea86e8f

…IA#9617)" This reverts commit 401d0d8. Signed-off-by: Nghia Truong <nghiat@nvidia.com> # Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Merge branch 'refactor_parquet_scan' into rebase_datatime

b14463f

# Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

Implement date time rebase

adc8ae2

ttnghia added 8 commits November 15, 2023 10:26

Remove seed override

d47d55f

Merge branch 'branch-23.12' into rebase_nested_timestamp

8bfca59

Merge branch 'rebase_nested_timestamp' into fix_9701

9392083

Change TimestampGen

3134dde

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Remove default seed

76b2d0a

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add default seed

61d7d3d

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Merge branch 'rebase_nested_timestamp' into fix_9701

c6f77e4

Remove default seed

ffc617a

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia added the test Only impacts tests label Nov 16, 2023

ttnghia requested a review from abellina November 16, 2023 00:20

ttnghia self-assigned this Nov 16, 2023

ttnghia marked this pull request as draft November 16, 2023 00:21

ttnghia changed the title ~~Fix TimestampGen to generate value not too close to the minimum allowed timestamp~~ Fix TimestampGen to generate value not too close to the minimum allowed timestamp [databricks] Nov 16, 2023

Merge branch 'branch-23.12' into fix_9701

33d13e0

# Conflicts: # integration_tests/src/main/python/parquet_test.py # integration_tests/src/main/python/parquet_write_test.py

ttnghia marked this pull request as ready for review November 16, 2023 17:07

ttnghia requested a review from revans2 November 16, 2023 17:07

revans2 approved these changes Nov 16, 2023

View reviewed changes

revans2 mentioned this pull request Nov 16, 2023

[FEA] Figure out the right long term way to deal with dates and timestamps in integration tests. #9747

Closed

Merge branch 'branch-23.12' into fix_9701

107e6cd

ttnghia merged commit 244ceab into NVIDIA:branch-23.12 Nov 17, 2023
37 checks passed

ttnghia deleted the fix_9701 branch November 17, 2023 03:33

jlowe mentioned this pull request Nov 21, 2023

Update timestamp gens to avoid "year 0 is out of range" errors #9821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `TimestampGen` to generate value not too close to the minimum allowed timestamp [databricks] #9736

Fix `TimestampGen` to generate value not too close to the minimum allowed timestamp [databricks] #9736

ttnghia commented Nov 16, 2023 •

edited

Loading

ttnghia commented Nov 16, 2023

ttnghia commented Nov 16, 2023

ttnghia commented Nov 16, 2023

revans2 commented Nov 16, 2023

abellina commented Nov 16, 2023

revans2 commented Nov 16, 2023

revans2 commented Nov 16, 2023

ttnghia commented Nov 16, 2023

Fix TimestampGen to generate value not too close to the minimum allowed timestamp [databricks] #9736

Fix TimestampGen to generate value not too close to the minimum allowed timestamp [databricks] #9736

Conversation

ttnghia commented Nov 16, 2023 • edited Loading

ttnghia commented Nov 16, 2023

ttnghia commented Nov 16, 2023

ttnghia commented Nov 16, 2023

revans2 commented Nov 16, 2023

abellina commented Nov 16, 2023

revans2 commented Nov 16, 2023

revans2 commented Nov 16, 2023

ttnghia commented Nov 16, 2023

Fix `TimestampGen` to generate value not too close to the minimum allowed timestamp [databricks] #9736

Fix `TimestampGen` to generate value not too close to the minimum allowed timestamp [databricks] #9736

ttnghia commented Nov 16, 2023 •

edited

Loading