-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TimestampGen
to generate value not too close to the minimum allowed timestamp [databricks]
#9736
Conversation
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
# Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
# Conflicts: # sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala # sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
build |
TimestampGen
to generate value not too close to the minimum allowed timestampTimestampGen
to generate value not too close to the minimum allowed timestamp [databricks]
build |
# Conflicts: # integration_tests/src/main/python/parquet_test.py # integration_tests/src/main/python/parquet_write_test.py
build |
I'm not sure that this is 100% what we want. Would it be better to do a date_diff in Spark and convert them all to Integers? And for timestamps could we use @jlowe @abellina what do you think? Is it better to restrict the range of timestamps/dates and risk that we have some odd corner case where we copy it back to the host. Or is it better to try and "normalize" the data into something that python does not have issues with? |
I would do both. One case we convert to longs and keep the range, the other we don't change the query and adjust as @ttnghia has. That way we don't add extra layers that could hide an issue? My two cents. |
I'm not sure how to do that. Converting all timestamps/dates to a different types is something that we can have an annotation for and programmatically insert it into the plan before we do a collect. I don't know now to coordinate that with datagen. But Lets check this in as is, and I'll file a follow on issue to explore if there is a better way to handle this. |
I filed #9747 |
build |
This modifies the
TimestampGen
class to generate values one month further from the minimum allowed value. That is because when the timestamps fall in0001-01
month may cause issue when reading from Spark into Pyspark:Closes #9701.