Original variable encodings are retained #471

derekocallaghan · 2023-01-12T12:36:21Z

Fixes Original time encoding units not used in Zarr generated by beam-refactor branch #465
Original variable encoding source and _FillValue are excluded
No test failures with these changes, although some warnings occur. I suspect the latter may be related to my local environment:

$ pytest tests
======================================================================================================================== test session starts =========================================================================================================================
platform linux -- Python 3.9.13, pytest-7.2.0, pluggy-1.0.0
rootdir: .../pangeo-forge-recipes, configfile: setup.cfg
plugins: anyio-3.6.2, lazy-fixture-0.6.3
collected 318 items / 1 skipped                                                                                                                                                                                                                                      

tests/test_aggregation.py .....                                                                                                                                                                                                                                [  1%]
tests/test_chunk_grid.py ........                                                                                                                                                                                                                              [  4%]
tests/test_combiners.py ..                                                                                                                                                                                                                                     [  4%]
tests/test_end_to_end.py ...                                                                                                                                                                                                                                   [  5%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 10%]
tests/test_combiners.py ..                                                                                                                                                                                                                                     [ 11%]
tests/test_end_to_end.py ...                                                                                                                                                                                                                                   [ 12%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 17%]
tests/test_combiners.py ..                                                                                                                                                                                                                                     [ 17%]
tests/test_end_to_end.py ...                                                                                                                                                                                                                                   [ 18%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 23%]
tests/test_combiners.py ......                                                                                                                                                                                                                                 [ 25%]
tests/test_locking.py ......                                                                                                                                                                                                                                   [ 27%]
tests/test_openers.py ........................                                                                                                                                                                                                                 [ 35%]
tests/test_transforms.py ................                                                                                                                                                                                                                      [ 40%]
tests/test_openers.py ..........................................................................                                                                                                                                                               [ 63%]
tests/test_patterns.py ........ss.............ss..............                                                                                                                                                                                                 [ 75%]
tests/test_pipelines.py ........                                                                                                                                                                                                                               [ 78%]
tests/test_rechunking.py .....................................                                                                                                                                                                                                 [ 89%]
tests/test_storage.py .......                                                                                                                                                                                                                                  [ 92%]
tests/test_transforms.py ......................                                                                                                                                                                                                                [ 99%]
tests/test_utils.py ..                                                                                                                                                                                                                                         [ 99%]
tests/test_writers.py .                                                                                                                                                                                                                                        [100%]

========================================================================================================================== warnings summary ==========================================================================================================================

<EXCLUDED WARNINGS>

====================================================================================================== 314 passed, 5 skipped, 28 warnings in 190.78s (0:03:10) =======================================================================================================

for more information, see https://pre-commit.ci

rabernat

This looks great. A few thoughts below.

rabernat · 2023-01-17T13:12:34Z

pangeo_forge_recipes/aggregation.py

-    d = ds.to_dict(data=False)
+    # Remove redundant encoding options
+    for v in ds.variables:
+        for option in ["_FillValue", "source"]:


Could you explain the rationale for special casing these two options?

I excluded these as they were causing certain test failures where expected schemas were compared with actual. E.g. any combiner tests using has_correct_schema():

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/beam-refactor/tests/test_combiners.py#L98-L102

def has_correct_schema(expected_schema): def _check_results(actual): assert len(actual) == 1 schema = actual[0] assert schema == expected_schema

The source will be unique to each original source data product, and the _FillValue appeared to be added automatically (I can't recall the specific issue with the latter though).

I checked the latter again, and when _FillValue is retained, it's being set to nan in expected_schema (as generated by the original ds.to_dict(data=False, encoding=True), but only for the lat and lon coords. However, the actual schema doesn't contain _FillValue for lat/lon, and the assert fails.

rabernat · 2023-01-17T13:14:12Z

pangeo_forge_recipes/aggregation.py

+            # TODO: should be okay to remove _FillValue?
+            if option in ds[v].encoding:
+                del ds[v].encoding[option]
+    d = ds.to_dict(data=False, encoding=True)


I think that when I first started working on this, this option didn't even exist yet! See pydata/xarray#6634

Nice when things come together. 😄

rabernat · 2023-01-17T13:14:50Z

pangeo_forge_recipes/aggregation.py

+            # Can combine encoding using the same approach as attrs
+            encoding = _combine_attrs(v1[vname]["encoding"], v2[vname]["encoding"])


rabernat · 2023-01-17T13:15:14Z

pangeo_forge_recipes/aggregation.py

+    # TODO: previous comment regarding encoding should no longer
+    # be relevant now that variable encoding will be used if available


Delete the irrelevant comment please. 🙏

rabernat · 2023-01-17T13:15:58Z

tests/test_aggregation.py

+    # Confirm original time units have been preserved
+    assert ds.time.encoding["units"] == dst.time.encoding["units"]


Would this test fail without the changes in aggregation.py?

I checked and it would fail for the wrong reason (KeyError) in that situation, I've fixed it and will commit (also in test_writers)

rabernat · 2023-01-17T13:21:31Z

tests/test_writers.py

+    # Zarr retains the original "days since %Y:%m%d" and removes " %H:%M:%S"
+    assert " ".join(ds.time.encoding["units"].split(" ")[0:-1]) == ds_target.time.encoding["units"]


Zarr has nothing to do with it. Xarray and cftime are managing this logic. I think it drops of the time if the time is 0:0:0.

Wouldn't it be easier to fix this at the source (i.e. strip the time in line 39 of data_generation.py)? That way you can just do

assert ds.time.encoding == ds_target.time.encoding

I originally stripped it in data_generation.py as you suggest, but I was sure that this was causing failures only in test_writer.py. I've restored this and all tests are passing, so I'm not sure what went wrong previously. Commit is on the way.

for more information, see https://pre-commit.ci

rabernat · 2023-01-17T16:05:13Z

I've been thinking through this more, and and I'm worried that there is an important edge case that is not covered by this. Currently, if encoding['units'] is not the same across all of the datasets, it will simply be dropped, following the logic in _combine_attrs:

pangeo-forge-recipes/pangeo_forge_recipes/aggregation.py

Lines 136 to 145 in 241ca87

    
           def _combine_attrs(a1: dict, a2: dict) -> dict: 
        
               if not a1: 
        
                   return a2 
        
               # for now, only keep attrs that are the same in both datasets 
        
               common_attrs = set(a1) & set(a2) 
        
               new_attrs = {} 
        
               for key in common_attrs: 
        
                   if a1[key] == a2[key]: 
        
                       new_attrs[key] = a1[key] 
        
               return new_attrs

So when pangeo forge goes to create the target dataset, it will not have any information about the time resolution and needed time encoding, and we will be back in the situation described by this comment:

pangeo-forge-recipes/pangeo_forge_recipes/aggregation.py

Lines 196 to 202 in 241ca87

    
           # we pick zeros as the safest value to initialize empty data with 
        
           # will only be used for dimension coordinates 
        
           # WARNING: there are lots of edge cases aroudn time! 
        
           # Xarray will pick a time encoding for the dataset (e.g. "days since days since 1970-01-01") 
        
           # and this may not be compatible with the actual values in the time coordinate 
        
           # (which we don't know yet) 
        
           data = dsa.zeros(shape=shape, chunks=chunks, dtype=dtype)

This PR and the use case that motivated it assume that the input files will all have the same encoding. But what if they have different encoding?

For example, we could have input files representing hourly data with time encoded like this

0 hours since 2000-01-01 00:00:00
0 hours since 2000-01-01 01:00:00
0 hours since 2000-01-01 02:00:00

(yes I have seen this in real datasets)

since the time encoding is not uniform, units would just be dropped. And Xarray would probably pick days since ... as the encoding when initializing the target dataset. This would probably screw up the time coordinate.

However, xarray would handle the data fine in an open_mfdataset situation, because it would fist decode each time to an actual datetime type. Upon saving, it would determine the encoding using this function.

If we stick to our current approach of not reading and combining the actual coordinates, we will have to recreate some of this logic within aggregate.py. Specifically, we will want to determine the minimum frequency (days, hours, s, etc.) and reference date for the time encoding by intelligently combining the encoding of each dataset.

That could be a heavy lift. So perhaps, for the interim, we would just want to raise an error if the time encoding is different for different datasets?

More test cases would be useful at exposing edge cases.

derekocallaghan · 2023-01-17T16:59:44Z

So when pangeo forge goes to create the target dataset, it will not have any information about the time resolution and needed time encoding, and we will be back in the situation described by this comment:

pangeo-forge-recipes/pangeo_forge_recipes/aggregation.py

Lines 196 to 202 in 241ca87

# we pick zeros as the safest value to initialize empty data with

# will only be used for dimension coordinates

# WARNING: there are lots of edge cases aroudn time!

# Xarray will pick a time encoding for the dataset (e.g. "days since days since 1970-01-01")

# and this may not be compatible with the actual values in the time coordinate

# (which we don't know yet)

data = dsa.zeros(shape=shape, chunks=chunks, dtype=dtype)

Based on this, I should restore some of the original warning comment text.

This PR and the use case that motivated it assume that the input files will all have the same encoding. But what if they have different encoding?

Yep, I believe the CCMP files all have the same encoding (units).

However, xarray would handle the data fine in an open_mfdataset situation, because it would fist decode each time to an actual datetime type. Upon saving, it would determine the encoding using this function.

When I originally created the Zarr store before the Pangeo Forge recipe, I used open_mfdataset(), which worked and generated a suitable time encoding. (I was thinking of suggesting this just before I read your comment here).

If we stick to our current approach of not reading and combining the actual coordinates, we will have to recreate some of this logic within aggregate.py. Specifically, we will want to determine the minimum frequency (days, hours, s, etc.) and reference date for the time encoding by intelligently combining the encoding of each dataset.

I think replicating or reusing the functionality in xarray.coding.time.infer_datetime_units() may be something worth looking at, possibly as a custom transform that precedes DetermineSchema in StoreToZarr.expand(), or instead in OpenWithinXarray.expand(). I need to do something not too far removed for the ASCAT recipes, where I'm hoping to use a custom transform to reorder the collection based on a modified time, prior to StoreToZarr. I'm not sure yet whether this is possible, but I can look into it.

That could be a heavy lift. So perhaps, for the interim, we would just want to raise an error if the time encoding is different for different datasets?

That can definitely be done.

rabernat · 2023-01-17T17:23:41Z

Actually, a nice middle ground would be to make it really easy to specify the desired time encoding. Like, if I know a priori that my data are daily resolution, I could just put the time encoding directly in the recipe. I'm wondering what would be the right way to specify this... 🤔

derekocallaghan · 2023-01-17T17:36:45Z

Actually, a nice middle ground would be to make it really easy to specify the desired time encoding. Like, if I know a priori that my data are daily resolution, I could just put the time encoding directly in the recipe. I'm wondering what would be the right way to specify this... thinking

Would this only work when all input files had the same time encoding? E.g. I guess we'd be back to the current issue in situations like your example above that has multiple units:

0 hours since 2000-01-01 00:00:00
0 hours since 2000-01-01 01:00:00
0 hours since 2000-01-01 02:00:00

rabernat · 2023-01-18T03:44:20Z

Would this only work when all input files had the same time encoding?

No, because the data are decoded before they are written, and then re-encoded to match the target encoding.

pangeo-forge-recipes/pangeo_forge_recipes/writers.py

Lines 27 to 36 in 81a9ce5

    
           def _store_data(vname: str, var: xr.Variable, index: Index, zgroup: zarr.Group) -> None: 
        
               zarr_array = zgroup[vname] 
        
               # get encoding for variable from zarr attributes 
        
               var_coded = var.copy()  # copy needed for test suit to avoid modifying inputs in-place 
        
               var_coded.encoding.update(zarr_array.attrs) 
        
               var_coded.attrs = {} 
        
               var = xr.backends.zarr.encode_zarr_variable(var_coded) 
        
               data = np.asarray(var.data) 
        
               region = _region_for(var, index) 
        
               zarr_array[region] = data

derekocallaghan · 2023-01-18T09:34:07Z

Would this only work when all input files had the same time encoding?

No, because the data are decoded before they are written, and then re-encoded to match the target encoding.

I understand it now, i.e. the specified target encoding would override any potential variation in input file encodings.

rabernat · 2023-01-18T14:03:59Z

I opened #480 to track the idea of specifying encoding in the recipe. We can address that in a follow-up PR. For now this is a big step forward.

derekocallaghan and others added 2 commits January 12, 2023 12:25

Original variable encodings are retained (pangeo-forge#465)

496036a

[pre-commit.ci] auto fixes from pre-commit.com hooks

0dae924

for more information, see https://pre-commit.ci

derekocallaghan changed the title ~~Original variable encodings are retained (#465)~~ Original variable encodings are retained Jan 12, 2023

rabernat self-requested a review January 16, 2023 19:51

rabernat reviewed Jan 17, 2023

View reviewed changes

derekocallaghan and others added 2 commits January 17, 2023 14:30

Updated following @rabernat review comments

ad2a610

[pre-commit.ci] auto fixes from pre-commit.com hooks

d9311aa

for more information, see https://pre-commit.ci

rabernat mentioned this pull request Jan 18, 2023

Allow user to specify target encoding and attrs in StoreToZarr #480

Open

rabernat merged commit 45ad985 into pangeo-forge:beam-refactor Jan 18, 2023

derekocallaghan mentioned this pull request Jan 27, 2023

Original time encoding units not used in Zarr generated by beam-refactor branch #465

Closed

maxrjones mentioned this pull request Mar 18, 2024

Retain original _FillValue in encoding #711

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Original variable encodings are retained #471

Original variable encodings are retained #471

derekocallaghan commented Jan 12, 2023

rabernat left a comment

rabernat Jan 17, 2023

derekocallaghan Jan 17, 2023

derekocallaghan Jan 17, 2023

rabernat Jan 17, 2023

rabernat Jan 17, 2023

rabernat Jan 17, 2023

derekocallaghan Jan 17, 2023

rabernat Jan 17, 2023

derekocallaghan Jan 17, 2023 •

edited

Loading

rabernat Jan 17, 2023

derekocallaghan Jan 17, 2023

rabernat commented Jan 17, 2023 •

edited

Loading

derekocallaghan commented Jan 17, 2023 •

edited

Loading

rabernat commented Jan 17, 2023

derekocallaghan commented Jan 17, 2023

rabernat commented Jan 18, 2023

derekocallaghan commented Jan 18, 2023

rabernat commented Jan 18, 2023

		# Can combine encoding using the same approach as attrs
		encoding = _combine_attrs(v1[vname]["encoding"], v2[vname]["encoding"])

		# TODO: previous comment regarding encoding should no longer
		# be relevant now that variable encoding will be used if available

		# Confirm original time units have been preserved
		assert ds.time.encoding["units"] == dst.time.encoding["units"]

		# Zarr retains the original "days since %Y:%m%d" and removes " %H:%M:%S"
		assert " ".join(ds.time.encoding["units"].split(" ")[0:-1]) == ds_target.time.encoding["units"]

Original variable encodings are retained #471

Original variable encodings are retained #471

Conversation

derekocallaghan commented Jan 12, 2023

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekocallaghan Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Jan 17, 2023 • edited Loading

derekocallaghan commented Jan 17, 2023 • edited Loading

rabernat commented Jan 17, 2023

derekocallaghan commented Jan 17, 2023

rabernat commented Jan 18, 2023

derekocallaghan commented Jan 18, 2023

rabernat commented Jan 18, 2023

derekocallaghan Jan 17, 2023 •

edited

Loading

rabernat commented Jan 17, 2023 •

edited

Loading

derekocallaghan commented Jan 17, 2023 •

edited

Loading