Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Timezone conversions for timestamps #2477

Open
jlowe opened this issue Aug 6, 2019 · 9 comments
Open

[FEA] Timezone conversions for timestamps #2477

jlowe opened this issue Aug 6, 2019 · 9 comments
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Aug 6, 2019

Is your feature request related to a problem? Please describe.
Apache Spark internally tracks timestamps relative to the epoch in the JVM's default timezone. cudf uses timestamps relative to the epoch in the UTC timezone (at least in the ORC loader). When Spark tries to interpret timestamp data created/loaded by cudf the values are incorrect (from Spark's viewpoint) unless the JVM's default timezone is UTC.

Describe the solution you'd like
A timezone conversion utility function for timestamps in libcudf would help, as the timestamp column could be converted from UTC to the JVM's timezone. The utility function would take an input timestamp column, a source timezone and a destination timezone, and would return a new timestamp column with the timestamp values adjusted by the time delta difference between the two timezones at the source timestamp's time. Note that due to effects like daylight savings, the delta between timezones is not necessarily constant and therefore this utility method can't be obviated by a simple scalar addition on the timestamp column data.

Describe alternatives you've considered
Adding an optional, target timezone parameter to the various cudf data loaders (like the ORC reader) could work as well and potentially be more efficient if it avoids the separate CUDA kernel to do the conversion from UTC. However the utility method approach seems like a more flexible approach to cover cases where the timestamp data source didn't originate from a cudf data loader.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Aug 6, 2019
@OlivierNV
Copy link
Contributor

Is this specific to ORC ? ORC has the concept of a writer timezone, and we actually convert from the writer timezone to UTC when loading, so if the intent is to keep the native timezone of the ORC file, it could be as simple as a flag to bypass the existing timezone conversion.
If it's a generic target timezone conversion, it should be relatively easy to do it in the ORC loader (converting to a particular timezone isn't much different than the existing conversion to UTC). For other loaders that are natively UTC, it would be preferable as a standalone utility method.

@jlowe
Copy link
Member Author

jlowe commented Aug 7, 2019

It's not specific to ORC. The intent is not to preserve the timezone of the file but rather to target a timezone that's specified by the caller. ORC is already doing timezone conversions, and that's why I mention it here. It would be slightly more efficient to have ORC directly convert to a specified timezone rather than convert to UTC and then the caller converts to another timezone.

I agree that a standalone utility method would be best if we're only going to have one way to rebase timestamps. That can apply to the CSV and Parquet loaders which don't do any timezone conversions today.

The utility would also provide a way to convert Spark timestamps back to UTC. This would probably be needed when cudf supports writing CSV, Parquet, and ORC, as I would expect those writers to assume all timestamps are UTC. Spark would need to convert timestamps relative to the JVM timezone to UTC for cudf writing unless the writers supported a parameter specifying the timestamp timezone reference.

@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Aug 7, 2019
@OlivierNV OlivierNV added the cuIO cuIO issue label Aug 8, 2019
@OlivierNV
Copy link
Contributor

Timezone conversions are super-messy, so I'd rather limit the impact of timezones (it also tends to be platform-specific since we need to parse timezone files from the OS to build conversion tables from/to UTC).
Compared to the column loader, a standalone timezone conversion should be very fast since it's fully data-parallel (internally, timezoneA->timezoneB would most likely be implemented as
tsB = fromUTC(toUTC(tsA, A), B).
If the goal is an arbitrary target timezone, I think an in-place standalone conversion would be the way to go.

@mjsamoht mjsamoht removed the Needs Triage Need team to review and classify label Aug 9, 2019
@jrhemstad
Copy link
Contributor

Since we are using libcu++ for our timestamp types, we should be using the C++20 timezone APIs that libcu++ will provide. CC @trxcllnt.

@trxcllnt
Copy link
Contributor

trxcllnt commented Dec 5, 2019

@brycelelbach does libcu++ have plans to load the system timezone database? I recall @ogiroux saying that might not be supported yet.

@jrhemstad
Copy link
Contributor

@brycelelbach promised me they would add it for us :)

@sameerz
Copy link
Contributor

sameerz commented Nov 12, 2020

Any update on whether libcu++ will load the timezone database?

@brycelelbach
Copy link
Contributor

Please file an RFE on github.com/NVIDIA/libcudacxx. It's not on our immediate roadmap, and slightly tricky, but we'll get to it eventually, depending on how urgently you need it.

@jrhemstad
Copy link
Contributor

NVIDIA/cccl#954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

8 participants