[FEA] Timezone conversions for timestamps #2477

jlowe · 2019-08-06T20:39:48Z

Is your feature request related to a problem? Please describe.
Apache Spark internally tracks timestamps relative to the epoch in the JVM's default timezone. cudf uses timestamps relative to the epoch in the UTC timezone (at least in the ORC loader). When Spark tries to interpret timestamp data created/loaded by cudf the values are incorrect (from Spark's viewpoint) unless the JVM's default timezone is UTC.

Describe the solution you'd like
A timezone conversion utility function for timestamps in libcudf would help, as the timestamp column could be converted from UTC to the JVM's timezone. The utility function would take an input timestamp column, a source timezone and a destination timezone, and would return a new timestamp column with the timestamp values adjusted by the time delta difference between the two timezones at the source timestamp's time. Note that due to effects like daylight savings, the delta between timezones is not necessarily constant and therefore this utility method can't be obviated by a simple scalar addition on the timestamp column data.

Describe alternatives you've considered
Adding an optional, target timezone parameter to the various cudf data loaders (like the ORC reader) could work as well and potentially be more efficient if it avoids the separate CUDA kernel to do the conversion from UTC. However the utility method approach seems like a more flexible approach to cover cases where the timestamp data source didn't originate from a cudf data loader.

OlivierNV · 2019-08-07T03:31:52Z

Is this specific to ORC ? ORC has the concept of a writer timezone, and we actually convert from the writer timezone to UTC when loading, so if the intent is to keep the native timezone of the ORC file, it could be as simple as a flag to bypass the existing timezone conversion.
If it's a generic target timezone conversion, it should be relatively easy to do it in the ORC loader (converting to a particular timezone isn't much different than the existing conversion to UTC). For other loaders that are natively UTC, it would be preferable as a standalone utility method.

jlowe · 2019-08-07T13:40:14Z

It's not specific to ORC. The intent is not to preserve the timezone of the file but rather to target a timezone that's specified by the caller. ORC is already doing timezone conversions, and that's why I mention it here. It would be slightly more efficient to have ORC directly convert to a specified timezone rather than convert to UTC and then the caller converts to another timezone.

I agree that a standalone utility method would be best if we're only going to have one way to rebase timestamps. That can apply to the CSV and Parquet loaders which don't do any timezone conversions today.

The utility would also provide a way to convert Spark timestamps back to UTC. This would probably be needed when cudf supports writing CSV, Parquet, and ORC, as I would expect those writers to assume all timestamps are UTC. Spark would need to convert timestamps relative to the JVM timezone to UTC for cudf writing unless the writers supported a parameter specifying the timestamp timezone reference.

OlivierNV · 2019-08-08T03:57:53Z

Timezone conversions are super-messy, so I'd rather limit the impact of timezones (it also tends to be platform-specific since we need to parse timezone files from the OS to build conversion tables from/to UTC).
Compared to the column loader, a standalone timezone conversion should be very fast since it's fully data-parallel (internally, timezoneA->timezoneB would most likely be implemented as
tsB = fromUTC(toUTC(tsA, A), B).
If the goal is an arbitrary target timezone, I think an in-place standalone conversion would be the way to go.

jrhemstad · 2019-12-04T22:19:04Z

Since we are using libcu++ for our timestamp types, we should be using the C++20 timezone APIs that libcu++ will provide. CC @trxcllnt.

trxcllnt · 2019-12-05T01:04:03Z

@brycelelbach does libcu++ have plans to load the system timezone database? I recall @ogiroux saying that might not be supported yet.

jrhemstad · 2019-12-05T01:05:27Z

@brycelelbach promised me they would add it for us :)

sameerz · 2020-11-12T22:41:34Z

Any update on whether libcu++ will load the timezone database?

brycelelbach · 2020-11-12T23:11:28Z

Please file an RFE on github.com/NVIDIA/libcudacxx. It's not on our immediate roadmap, and slightly tricky, but we'll get to it eventually, depending on how urgently you need it.

jrhemstad · 2020-11-13T19:53:20Z

NVIDIA/cccl#954

jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Aug 6, 2019

jlowe added the Spark Functionality that helps Spark RAPIDS label Aug 7, 2019

OlivierNV added the cuIO cuIO issue label Aug 8, 2019

mjsamoht removed the Needs Triage Need team to review and classify label Aug 9, 2019

gerashegalov mentioned this issue May 27, 2022

[BUG] Inconsistency between the time zone in the fallback reason and the actual time zone checked in RapidsMeta.checkTImeZoneId NVIDIA/spark-rapids#5678

Closed

shwina mentioned this issue Aug 24, 2022

[FEA] TZiF reader to support timezone-aware operations #11592

Closed

GregoryKimball added this to the Support datetime localization in cuDF milestone Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Timezone conversions for timestamps #2477

[FEA] Timezone conversions for timestamps #2477

jlowe commented Aug 6, 2019

OlivierNV commented Aug 7, 2019

jlowe commented Aug 7, 2019

OlivierNV commented Aug 8, 2019

jrhemstad commented Dec 4, 2019

trxcllnt commented Dec 5, 2019

jrhemstad commented Dec 5, 2019

sameerz commented Nov 12, 2020

brycelelbach commented Nov 12, 2020

jrhemstad commented Nov 13, 2020

[FEA] Timezone conversions for timestamps #2477

[FEA] Timezone conversions for timestamps #2477

Comments

jlowe commented Aug 6, 2019

OlivierNV commented Aug 7, 2019

jlowe commented Aug 7, 2019

OlivierNV commented Aug 8, 2019

jrhemstad commented Dec 4, 2019

trxcllnt commented Dec 5, 2019

jrhemstad commented Dec 5, 2019

sameerz commented Nov 12, 2020

brycelelbach commented Nov 12, 2020

jrhemstad commented Nov 13, 2020