Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default time encoding of nanoseconds is NOT good. #9154

Open
5 tasks
ChrisBarker-NOAA opened this issue Jun 22, 2024 · 16 comments
Open
5 tasks

Default time encoding of nanoseconds is NOT good. #9154

ChrisBarker-NOAA opened this issue Jun 22, 2024 · 16 comments
Labels
bug needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports topic-CF conventions

Comments

@ChrisBarker-NOAA
Copy link

What happened?

When you call to_netcdf() on a Dataset without an encoding for the time variable set, you get: "nanoseconds since ..."

This is not good, as nanoseconds aren't support by a lot of downstream software, e.g. cftime, etc. (I'm not sure if it's CF compliant).

I see that this was changed somewhere after #3942

It is an improvement over integer days, that's for sure, but not the best option.

I understand that the internal panda datetime64 type is ns -- and this preserves any precision that may be there, but:

"practicality beats purity"

And this is not practical. Frankly, I wonder why pandas chose nanoseconds as the default, but it seems like the common xarray (and netcdf) use cases, ns is completely unneeded precision.

I'd suggest integer milliseconds, or, frankly even seconds.

Alternatively, use, say, floating point hours, which would buy you a lot of precision for small values -- I can't imagine who is using nanoseconds and also cares about years of timespan.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

@ChrisBarker-NOAA ChrisBarker-NOAA added bug needs triage Issue that has not been reviewed by xarray team member labels Jun 22, 2024
Copy link

welcome bot commented Jun 22, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@spencerkclark
Copy link
Member

@ChrisBarker-NOAA could you provide a minimal working example? While there are select circumstances where xarray will do this, I do not think this should be the case in general.

@kmuehlbauer
Copy link
Contributor

@ChrisBarker-NOAA In general xarray tries to encode times to the lowest possible resolution.

So if the wanted resolution is say days and the data itself can't be represented like that, then the next fitting resolution is calculated (eg. hours).

This also works if no resolution is given. Xarray tries to calculate that resolution where all data can be represented in.

There might have been changes in these parts of the code or other changes in behaviour due to dependency updates. I second @spencerkclark here, a MCVE would help to debug.

@kmuehlbauer kmuehlbauer added needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports topic-CF conventions and removed needs triage Issue that has not been reviewed by xarray team member labels Jun 22, 2024
@spencerkclark
Copy link
Member

To add a little more context, @ChrisBarker-NOAA you are right to point out that my comment (#3942 (comment)) ignores considerations such as downstream (non-xarray or even non-Python) use-cases and (possibly) CF compliance, which I think may have been additional (perhaps more fundamental) reasons that xarray did not support encoding units finer than seconds previously (#4183 (comment)). If I were to write it again I would probably not have been as absolute.

As @kmuehlbauer notes, for NumPy arrays of times we still try to use the coarsest possible resolution that enables exact encoding with int64 values. To me this represents a reasonable compromise. For users whose dates do not need nanosecond precision, times continue to be encoded with coarser units, but for users whose dates do need it, times are encoded with finer units such that they can be exactly round-tripped to / from disk (#4045, #4684).

For chunked arrays—and maybe this is what you are running into—there was a recent change to use "nanoseconds since 1970-01-01" as the default units if no encoding is specified (#8575). The rationale there was that to infer the coarsest possible units that would enable an exact round trip would otherwise require a full compute (enabling lazy encoding was the primary premise of that PR). Even though this is an advanced use-case, I understand that this could be surprising and/or inconvenient depending on the application. Does the time data you are writing to netCDF happen to be chunked?

@ChrisBarker-NOAA
Copy link
Author

ChrisBarker-NOAA commented Jun 23, 2024

@spencerkclark: yup, that's probably it.

We've been bitten by this a fair bit lately -- always(?) when loading a Dataset from some kind a remote dataset: OPeNDAP, Zarr (via kerchunked data on AMS), or MFDataset, etc.

Then the time encoding gets lost or a new time array is created for one reason or another, and we end up with the nanosecond-based-encoding.

So yes, chunked.

Another issue is that it's all too common for folks to use a float datatype in the time variable, which means you get essentially arbitrary precision. (would you even get nanoseconds? not sure, I haven't done the math.)

So what to do?

  1. looking at the past issues I see, folks did want milliseconds, but has anyone asked for nanoseconds? (i know, if it works, folks could be using them without us knowing it). I do think that defaulting to seconds was probably too coarse. So we could simply change the default to milliseconds, and I think that would solve the problem at hand.

  2. Why do we get datetime65[ns] be default when decoding a netcdf file? I'm guessing that's inherited from Pandas, and maybe isn't easy to change, but if it could be changed, we could use [us] or even [s] (or some kind of smart decoding) instead - and then we're good to go.

-CHB

@ChrisBarker-NOAA
Copy link
Author

Hmm -- if this is right (from a random unrelated issue on some other project):

"Since 2.0.0 [pandas] actually supports timestamp precisions other than nanoseconds"

Looks like it:

https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution

So maybe there's the solution :-)

@kmuehlbauer
Copy link
Contributor

@ChrisBarker-NOAA One reason to use datetime64[ns] internally is that we can keep the int64 representation in memory (and on disk) also in the presence of NaT. np.datetime64('nat') and the netcdf default fillvalue for int64 are the same -> -9223372036854775808 (see netcdf.h).

In case of encoded _FillValue/missing_value we can use int64 throughout instead of having to convert to float (to represent the nan values). The latter has lead to a bunch of issues in the past, where ns precision was lost. See #7817 and #7872.

Please also see #7493 where the new pandas non-nanosecond datetimes are discussed.

I did not look deeper into this new behaviour, but it should be possible to make xarray aware of non-nanosecond datetime resolution. I fear this will be a major undertaking, though.

@ChrisBarker-NOAA
Copy link
Author

One reason to use datetime64[ns] internally is that we can keep the int64 representation in memory (and on disk) also in the presence of NaT. np.datetime64('nat') and the netcdf default fillvalue for int64 are the same -> -9223372036854775808

Which is a good reason to use datetime64 -- (and 64 bit integers in general) -- but numpy's datetime64 can choose what "unit" you want, so datetime64[ms] or datetime64[s] satisfies these same criteria.

The latter has lead to a bunch of issues in the past, where ns precision was lost. See #7817 and #7872.

Very good reasons not to covert to float -- so yes, but I'm not suggesting that.

Frankly, numpy's datetime64 was not all that thoroughly reviewed before it was implemented, and has been trouble ever since :-( -- but that's water under the bridge.

As for nanoseconds: I suspect that was chosen by pandas originally, because "why not" -- i.e. lots of extra bits to use.

But whether xarray continues to use ns internally is another question -- has ANYONE ever asked for ns?

But a breaking change is a breaking change, which is not good.

I did not look deeper into this new behaviour, but it should be possible to make xarray aware of non-nanosecond datetime resolution. I fear this will be a major undertaking, though.

An easier option would probably be to go to datetime64[ms] instead, though somewhere out there is someone actually using nanoseconds that might complain :-(

But it may not be THAT major an undertaking -- grepping the code, there are indeed a lot of references to datetime64[ns]. but when you remove the tests and docstrings -- not all that many.

so. maybe a lot to change, but mostly boilerplate :-(

The first question is -- would this be a welcome change? If so, maybe I can take a look ...

Option 1) Change [ns] to [ms] as the static xarray datetime.

  • that would be easier, solve the problem at hand, and probably work for 99.9% or users.
    But
    a) are there any users out there that SO actually need nanoseconds?
    b) it would break (or at least be confusing) a lot of demos, docs, etc.

So:

Option(2) is probably the way to go -- let it be settable -- and then we can decide where to change the default -- e.g. I would suggest using [s], or maybe [ms] for default netcdf decoding, for instance.

How good do we think the test coverage is ?

Finally: option (3) -- just change the default behavior for encoding netcdf (or other similar formats (zarr) -- essentially any time you are doing a "units since datetime" encoding, use milliseconds.

That would be relatively easy, and I suspect break very little -- and anyone that does need nanoseconds could specify that.

@spencerkclark
Copy link
Member

spencerkclark commented Jun 26, 2024

@spencerkclark: yup, that's probably it.

Thanks @ChrisBarker-NOAA, that's good to know. For this issue, my first thought was something along the lines of your option (3) for chunked arrays. The idea would be to switch to using default units of "seconds since 1970-01-01" or "milliseconds since 1970-01-01" in that context and continue using a default dtype of int64. If finer precision were required to encode the times for a faithful round trip, we would raise an error, recommending explicitly specifying the units encoding. While this would lead to a slightly worse user experience for those who might want / need nanosecond encoding units in the chunked array context12, it would be a better user experience for those who do not, at least in the context of downstream uses.

Indeed as @kmuehlbauer notes, we are aware of the new flexibility in pandas. I agree that it would be a nontrivial amount of work to enable that flexibility in xarray. There probably is a lot of boilerplate, but I think there are a few things we'll need to think carefully about too. It is likely best to separate that from this issue.

Footnotes

  1. We have had some users ask for nanosecond-units decoding before (#4183), and it is important for faithfully roundtripping some times (#4045 (comment)). It is admittedly rare, however.

  2. pandas-dev/pandas#7307 (comment) provides some historical context for why nanosecond precision was initially chosen in pandas.

@max-sixty
Copy link
Collaborator

Should we keep this open or move the pandas impl to a new issue?

@ChrisBarker-NOAA
Copy link
Author

I'd like to keep it open:

There are two (maybe separate) issues here:

  1. should xarray support pandas' new ability to specify the time unit?
  2. what should be the default "CF style" encoding?

Granted, if (1) is done, then (2) could be whatever the user specified (though there would still be a default to decide)...

But it will take a while to get (1) done -- so I still advocate for changing the default for the "CF style" time encoding.

@spencerkclark: thanks for the link to the Panda's discussion -- very enlightening. A key take-away from from me:

"realize that pandas's plurality use case (if we were truly omniscient) nowadays is in the analysis of data generated by business processes occurring since the advent of computers."

That was true then, and may still be true now for Pandas -- but it's likely not true for xarray, and particular not true for folks using the CF-style encoding. Those users would rather a cftime compliant encoding, and don't need nanoseconds. (we know they don't because it's not actually supported anyway!)

So I still think that xarray should change its default CF-style encoding to microseconds sooner than later, regardless of how (1) proceeds.

@spencerkclark
Copy link
Member

spencerkclark commented Aug 29, 2024

For sure, @ChrisBarker-NOAA, I am in agreement that we should switch away from specifying nanoseconds units by default for datetime64[ns] data stored in chunked arrays. The one thing I want to make sure of when we make that switch is that we never silently truncate precision when writing chunked data to disk (i.e. we at least warn or raise an error), which is not currently the case, e.g. #9134 (comment). Nanoseconds at least have the virtue that we are never at risk of that.

Note that in the meantime xarray uses microseconds by default for chunked arrays of cftime.datetime objects, so if you open data with use_cftime=True (and the time encoding gets lost) this will already be the case regardless of the calendar of the data.

@ChrisBarker-NOAA
Copy link
Author

I'm a bit confused about dask vs chunked vs netcdf or ???.

I would assume that you can use dask arrays themselves with the native numpy datetime64? yes?

Anyway, my use case is saving out a xarray Dataset to netcdf4, when a time variable does not have unit defined -- it has been saving out in (nanoseconds since ...) which then breaks code that tries to read the resulting file (e.g. cftime python lib).

But:

if dates.dtype == "O":
units = "microseconds since 1970-01-01"
dtype = np.dtype("int64")

seems to indicate that it woule use microseconds -- did that code change recently?

I'll go test now ....

I want to make sure of when we make that switch is that we never silently truncate precision when writing chunked data to disk (i.e. we at least warn or raise an error)

Agreed -- silent data loss is bad. However, that would require actually checking all the values -- which may not be available, yes?

Anyway, if it can be checked, then Error or Warning?

Warning seems right to me, though people do tend to ignore them, particularly if it's running on some service, rather than interactively.

But Error is a problem because then code that has worked fine with all the test data, etc, might then raise when passed some new data. (and if time was converted (or computed) from, e.g. floats, might have spurious sub-microseconds in it.)

So: if there's going to be an Error, then it would be good to have a flag that you can set to turn it off:

allow_truncation_to_microseconds

or some such -- not sure where that flag would go though.

Thanks,
-CHB

@ChrisBarker-NOAA
Copy link
Author

OK -- did some more experimenting, and I think I get it now why this is about chunked data -- if not chunked, xarray seems to examine the values and choose a reasonable encoding -- nice! Thanks to whoever write that code!

But I presume with chunked arrays, you can't examine all the values at once, so can't follow that precident.

if you open data with use_cftime=True (and the time encoding gets lost) this will already be the case regardless of the calendar of the data.

I'm trying to remember where we had this problem, because that was not our experience. I'll have to poke about some more, but I think it may have been:

  • wonky, non cf-compliant time in a file
    • in fact, part of the goal of this workflow was to fix the time :-)
    • because of this, we had to open it without use_cftime=True
  • So when written out, the nanoseconds default was used.

My proposal:

Define a flag (somewhere -- not sure where) along the lines of:
truncate_to_microseconds

When True, then microseconds would be the minimum unit used.

When False, use current behavior.

I would prefer True be the default, but if we do think we'll break folks' code, then False is OK. At least we'll have an easy fix when this comes up.

@spencerkclark
Copy link
Member

Agreed -- silent data loss is bad. However, that would require actually checking all the values -- which may not be available, yes?

While we indeed cannot infer the appropriate units to choose without examining the entire variable, we should be able to determine whether the a priori chosen units allow for exact encoding of each chunk independently and warn or raise accordingly within each subtask.

I'm trying to remember where we had this problem, because that was not our experience. I'll have to poke about some more, but I think it may have been:

  • wonky, non cf-compliant time in a file

    • in fact, part of the goal of this workflow was to fix the time :-)
    • because of this, we had to open it without use_cftime=True
  • So when written out, the nanoseconds default was used.

I see—with this workflow I believe it would still be possible to ensure cftime.datetime objects are used—I'm assuming you are opening the data with xr.open_dataset(..., decode_times=False), fixing the time encoding units, and then running xr.decode_cf on the Dataset? In this case use_cftime=True would be passed to xr.decode_cf instead of xr.open_dataset.

The other workaround of course is to load / compute all the time-like variables before saving to disk—it is not that common that time arrays in climate datasets require a lot of memory or are computationally expensive to produce, so this generally is not so bad (for many years xarray happened to do this automatically—i.e. it did not lazily encode chunked time-like arrays).

Up to you though whether these implicit workarounds are more intuitive than the explicit workaround of manually specifying the time encoding units, which is generally what I do if I know a downstream application is sensitive to the choice, even if the times are in memory :).

@ChrisBarker-NOAA
Copy link
Author

ChrisBarker-NOAA commented Sep 3, 2024

we should be able to determine whether the a priori chosen units allow for exact encoding of each chunk independently and warn or raise accordingly within each subtask.

OK -- that should be a warning, rather than a error -- it would be very bad for a workflow that works for years to suddenly fail due to one wonky value in a dataset :-)

Unless there's a "allow truncate to milliseconds flag that's True by default"

fixing the time encoding units, and then running xr.decode_cf on the Dataset? In this case use_cftime=True would be passed to xr.decode_cf instead of xr.open_dataset.

well no -- never did a xr.decode_cf -- and probably couldn't -- what we had at that point was a Timeseries, so cf to decode.

maybe there's a way to pass use_cftime=True when I save to netcdf?

But anyway, we probably could find a better workflow (though setting the encoding isn't too bad, once we know we need to do it) but I'd still like to see better defaults.

What this conversation has revealed to me is why this hasn't bitten a lot of people much of the time :-)

Maybe this?

Have a flag: allow_truncation_to_milliseconds (or maybe simply allow_truncation?)

If True be default:

  • if True, then all is good, nothing to be done. (or maybe raise a Warning?)
  • if False, then raise an ValueError if any truncation takes place.

If 'False" bey default:

  • if True, then all is good, nothing to be done.
  • if False, then raise an warning if any truncation takes place.

We could get a bit more complicated and have it be None by default, and then:

  • if 'None', raise an warning if any truncation takes place.
  • if True, then all is good, nothing to be done.
  • if False, then raise an ValueError if any truncation takes place.

The goals here are:

  1. Get milliseconds by default
  2. If a user doesn't know about (or it's old code), and therefore doesn't set, a flag, previous functional code wont' Error out due to the values in the TimeSeries.
  3. A user that hasn't specifically opted in to the truncation will get at least a warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports topic-CF conventions
Projects
None yet
Development

No branches or pull requests

4 participants