Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TimestampGen to generate value not too close to the minimum allowed timestamp [databricks] #9736

Merged
merged 86 commits into from
Nov 17, 2023

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Nov 16, 2023

This modifies the TimestampGen class to generate values one month further from the minimum allowed value. That is because when the timestamps fall in 0001-01 month may cause issue when reading from Spark into Pyspark:

self = TimestampType, ts = -62135596800000000

    def fromInternal(self, ts):
        if ts is not None:
            # using int to avoid precision loss in float
>           return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
E           ValueError: year 0 is out of range

Closes #9701.

ttnghia and others added 30 commits August 28, 2023 16:15
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
# Conflicts:
#	sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
…IA#9617)"

This reverts commit 401d0d8.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

# Conflicts:
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
# Conflicts:
#	sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@ttnghia ttnghia added the test Only impacts tests label Nov 16, 2023
@ttnghia ttnghia self-assigned this Nov 16, 2023
@ttnghia ttnghia marked this pull request as draft November 16, 2023 00:21
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 16, 2023

build

@ttnghia ttnghia changed the title Fix TimestampGen to generate value not too close to the minimum allowed timestamp Fix TimestampGen to generate value not too close to the minimum allowed timestamp [databricks] Nov 16, 2023
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 16, 2023

build

# Conflicts:
#	integration_tests/src/main/python/parquet_test.py
#	integration_tests/src/main/python/parquet_write_test.py
@ttnghia ttnghia marked this pull request as ready for review November 16, 2023 17:07
@ttnghia ttnghia requested a review from revans2 November 16, 2023 17:07
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 16, 2023

build

@revans2
Copy link
Collaborator

revans2 commented Nov 16, 2023

I'm not sure that this is 100% what we want. Would it be better to do a date_diff in Spark and convert them all to Integers? And for timestamps could we use unix_micros to get the results out as longs?

@jlowe @abellina what do you think? Is it better to restrict the range of timestamps/dates and risk that we have some odd corner case where we copy it back to the host. Or is it better to try and "normalize" the data into something that python does not have issues with?

@abellina
Copy link
Collaborator

Reference in new is

I would do both. One case we convert to longs and keep the range, the other we don't change the query and adjust as @ttnghia has. That way we don't add extra layers that could hide an issue? My two cents.

@revans2
Copy link
Collaborator

revans2 commented Nov 16, 2023

I would do both. One case we convert to longs and keep the range, the other we don't change the query and adjust as @ttnghia has. That way we don't add extra layers that could hide an issue? My two cents.

I'm not sure how to do that. Converting all timestamps/dates to a different types is something that we can have an annotation for and programmatically insert it into the plan before we do a collect. I don't know now to coordinate that with datagen. But Lets check this in as is, and I'll file a follow on issue to explore if there is a better way to handle this.

@revans2
Copy link
Collaborator

revans2 commented Nov 16, 2023

I filed #9747

@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 16, 2023

build

@ttnghia ttnghia merged commit 244ceab into NVIDIA:branch-23.12 Nov 17, 2023
37 checks passed
@ttnghia ttnghia deleted the fix_9701 branch November 17, 2023 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
3 participants