Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time dtype encoding defaulting to int64 when writing netcdf or zarr #3942

Open
rafa-guedes opened this issue Apr 6, 2020 · 8 comments
Open
Labels
topic-error reporting topic-zarr Related to zarr storage library

Comments

@rafa-guedes
Copy link
Contributor

rafa-guedes commented Apr 6, 2020

Time dtype encoding defaults to "int64" for datasets with only zero-hour times when writing to netcdf or zarr.

This results in these datasets having a precision constrained by how the time units are defined (in the example below daily precision, given units are defined as 'days since ...'). If we for instance create a zarr dataset using this default encoding option with such datasets, and subsequently append some non-zero times onto it, we loose the hour/minute/sec information from the appended bits.

MCVE Code Sample

In [1]: ds = xr.DataArray( 
    ...: data=[0.5], 
    ...: coords={"time": [datetime.datetime(2012,1,1)]}, 
    ...: dims=("time",), 
    ...: name="x", 
    ...: ).to_dataset()

In [2]: ds                                                                                                                                                            
Out[2]: 
<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    x        (time) float64 0.5

In [3]: ds.to_zarr("/tmp/x.zarr")

In [4]: ds1 = xr.open_zarr("/tmp/x.zarr")

In [5]: ds1.time.encoding                                                                                                                                             
Out[5]: 
{'chunks': (1,),
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 'units': 'days since 2012-01-01 00:00:00',
 'calendar': 'proleptic_gregorian',
 'dtype': dtype('int64')}

In [6]: dsnew = xr.DataArray( 
    ...: data=[1.5], 
    ...: coords={"time": [datetime.datetime(2012,1,1,3,0,0)]}, 
    ...: dims=("time",), 
    ...: name="x", 
    ...: ).to_dataset()

In [7]: dsnew.to_zarr("/tmp/x.zarr", append_dim="time")                                                                                                               

In [8]: ds1 = xr.open_zarr("/tmp/x.zarr")                                                                                                                             

In [9]: ds1.time.values                                                                                                                                               
Out[9]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')

Expected Output

In [9]: ds1.time.values                                                                                                                                               
Out[9]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-01T03:00:00.000000000'],
      dtype='datetime64[ns]')

Problem Description

Perhaps it would be useful defaulting time dtype to "float64". Another option could be using a finer time resolution by default than that automatically defined from xarray based on the dataset times (for instance, if the units are automatically defined as "days since ...", use "seconds since...".


#### Versions

<details><summary>Output of `xr.show_versions()`</summary>

In [10]: xr.show_versions()                                                                                                                                            

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.5 (default, Nov 20 2019, 09:21:52) 
[GCC 9.2.1 20191008]
python-bits: 64
OS: Linux
OS-release: 5.3.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: en_NZ.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.3

xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: None
dask: 2.14.0
distributed: 2.12.0
matplotlib: 3.2.0
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 45.3.0
pip: 20.0.2
conda: None
pytest: 5.3.5
IPython: 7.13.0
sphinx: None

</details>
@dcherian
Copy link
Contributor

dcherian commented Apr 7, 2020

I have run in to this problem before.

The initial choice to use int64 and days since ... is perfectly valid. However I think xarray should raise an error when trying to append times that cannot be represented by the on-disk encoding.

Note that you can always specify an encoding to make sure that you can append properly.

@dcherian dcherian added topic-error reporting topic-zarr Related to zarr storage library labels Apr 7, 2020
@rabernat
Copy link
Contributor

rabernat commented Apr 7, 2020

I agree with Deepak. Xarray intelligently chooses its encoding when it write the initial dataset to make sure it has enough precision to resolve all times. It cannot magically know that, in the future, you plan to append data which requires greater precision. Your options are:

I also agree that we should definitely be raising a warning (or even an error) in your situation.

@rafa-guedes
Copy link
Contributor Author

rafa-guedes commented Apr 7, 2020

Yep I managed to overcome this by manually setting encoding parameters, just wondering if there would be any downside in preferring float64 over int64 when automatically defining these? This seems to fix that issue. I guess it could result in some other precision losses due to float-point errors but these should be small..

@JackKelly
Copy link
Contributor

JackKelly commented Nov 10, 2021

I think I've bumped into a symptom of this issue (my issue is described in #5969). And I think #3379 may be another symptom of this issue.

Perhaps I'm biased (because I work with timeseries which only span a few years) but I wonder if xarray should default to encoding time as 'units': 'nanoseconds since 1970-01-01' (to be consistent with np.datetime64[ns]) unless the timeseries includes dates before the year 1677, or after the year 2262 🙂? Would that work?

If that's no good, then let's definitely add a note to the documentation to say that it might be a good idea for users to manually specify the encoding for datetimes if they wish to append to Zarrs.

@dcherian
Copy link
Contributor

let's definitely add a note to the documentation to say that it might be a good idea for users to manually specify the encoding for datetimes if they wish to append to Zarrs.

👍

However I think xarray should raise an error when trying to append times that cannot be represented by the on-disk encoding.

Adding this error message would make it obvious that this is happening. PRs are very welcome!

@JackKelly
Copy link
Contributor

JackKelly commented Nov 10, 2021

Cool, I agree that an error and a documentation change is likely to be sufficient 🙂 (and I'd be keen to write a PR to help out!)

But, before we commit to that path: Please may I ask: Why not default to xarray encoding time as 'units': 'nanoseconds since 1970-01-01' to be consistent with np.datetime64[ns]? Sorry if I've missed something obvious!

@dcherian
Copy link
Contributor

Please may I ask: Why not default to xarray encoding time as 'units': 'nanoseconds since 1970-01-01' to be consistent with np.datetime64[ns]?

It's choosing the highest resolution that matches the data, which has the benefit of allowing the maximum possible time range given the data's frequency:

for time_unit in time_units:
if np.all(timedeltas % unit_timedelta(time_unit) == zero_timedelta):
return time_unit

I'm not sure if this is why it was originally chosen; but that is one advantage. Perhaps @spencerkclark has some insight here.

@spencerkclark
Copy link
Member

spencerkclark commented Nov 11, 2021

This logic has been around in xarray for a long time (I think it dates back to #12!), so it predates me. If I had to guess though, it would have to do with the fact that back then, a form of cftime.date2num was used to encode all times, even those that started as np.datetime64 values. I think that's significant for two reasons:

  1. In the old days, date2num would only return floating point values, even if the times could in principle be encoded with integers. For that reason, for accuracy reasons, it was best to keep the encoded values as small as possible to avoid roundoff error.
  2. Even if (1) was not the case back then, date2num did not -- and still does not -- support nanosecond units, because it relies on microsecond-precision datetimes.

This of course is not true anymore. We no longer use date2num to encode np.datetime64 values, and we no longer encode dates with floating point values by default (#4045); we use integers for optimal round-tripping accuracy, and are capable of encoding dates with nanosecond units.

To be honest, currently it seems the only remaining advantage to choosing a larger time encoding unit and proximate reference date is that it makes the raw encoded values a little more human-readable. However, encoding dates with units of "nanoseconds since 1970-01-01" is objectively optimal for np.datetime64[ns] values, as it guarantees the maximum range of possible encoded times, and maximum round-trip accuracy, so it could be worth revisiting our approach in light of the fact that it makes appending somewhat dangerous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-error reporting topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

5 participants