-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datatree alignment docs #9501
base: main
Are you sure you want to change the base?
Datatree alignment docs #9501
Changes from all commits
ae71437
928767a
1adb945
ae1bcfd
f025371
7549ee9
b631697
02bf96b
152d74a
57b7f06
d3ac1a7
adf7579
d73dd8a
d779e22
3c9ad55
5a4309a
4cee745
4c030d8
09385fd
35ab311
6db4a0b
22f2726
e879dbb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
.. _hierarchical-data: | ||
|
||
Hierarchical data | ||
============================== | ||
================= | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
@@ -15,6 +15,8 @@ Hierarchical data | |
|
||
%xmode minimal | ||
|
||
.. _why: | ||
|
||
Why Hierarchical Data? | ||
---------------------- | ||
|
||
|
@@ -644,3 +646,163 @@ We could use this feature to quickly calculate the electrical power in our signa | |
|
||
power = currents * voltages | ||
power | ||
|
||
.. _alignment-and-coordinate-inheritance: | ||
|
||
Alignment and Coordinate Inheritance | ||
------------------------------------ | ||
|
||
.. _data-alignment: | ||
|
||
Data Alignment | ||
~~~~~~~~~~~~~~ | ||
Comment on lines
+657
to
+658
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: add comment about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is the only note I have on open_groups, it probably deserves more. https://github.com/pydata/xarray/blob/main/doc/getting-started-guide/quick-overview.rst?plain=1#L284 |
||
|
||
The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be aligned (LINK HERE) with those in their parent nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is where I would like to link to some generic documentation on what alignment is, but it doesn't really exist, see #9500. |
||
|
||
.. note:: | ||
If you were a previous user of the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package, this is different from what you're used to! | ||
In that package the data model was that nodes actually were completely unrelated. The data model is now slightly stricter. | ||
This allows us to provide features like :ref:`coordinate-inheritance`. See the migration guide for more details on the differences (LINK). | ||
|
||
To demonstrate, let's first generate some example datasets which are not aligned with one another: | ||
|
||
.. ipython:: python | ||
|
||
# (drop the attributes just to make the printed representation shorter) | ||
ds = xr.tutorial.open_dataset("air_temperature").drop_attrs() | ||
|
||
ds_daily = ds.resample(time="D").mean("time") | ||
ds_weekly = ds.resample(time="W").mean("time") | ||
ds_monthly = ds.resample(time="ME").mean("time") | ||
|
||
These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension. | ||
|
||
.. ipython:: python | ||
|
||
ds_daily.sizes | ||
ds_weekly.sizes | ||
ds_monthly.sizes | ||
|
||
We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it would be more correct to say that we cannot store them unchanged. |
||
|
||
.. ipython:: python | ||
:okexcept: | ||
|
||
xr.align(ds_daily, ds_weekly, ds_monthly, join="exact") | ||
|
||
But we :ref:`previously said <why>` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`? | ||
If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error: | ||
|
||
.. ipython:: python | ||
:okexcept: | ||
|
||
xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly}) | ||
|
||
(TODO: Looks like this error message could be improved by including information about which sizes are not equal.) | ||
|
||
This is because DataTree checks that data in child nodes align exactly with their parents. | ||
|
||
.. note:: | ||
This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions <https://www.unidata.ucar.edu/software/netcdf/workshops/2007/groups-types/Introduction.html>`_, as in netCDF-4 files dimensions are `visible to all child groups <https://docs.unidata.ucar.edu/netcdf-c/current/groups.html>`_. | ||
|
||
This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds: | ||
|
||
.. code:: python | ||
|
||
xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact") | ||
|
||
To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings. | ||
|
||
.. ipython:: python | ||
|
||
dt = xr.DataTree.from_dict( | ||
{"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly} | ||
) | ||
dt | ||
|
||
Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group. | ||
|
||
This is a useful way to organise our data because we can still operate on all the groups at once. | ||
For example we can extract all three timeseries at a specific lat-lon location: | ||
|
||
.. ipython:: python | ||
|
||
dt.sel(lat=75, lon=300) | ||
|
||
or compute the standard deviation of each timeseries to find out how it varies with sampling frequency: | ||
|
||
.. ipython:: python | ||
|
||
dt.std(dim="time") | ||
|
||
.. _coordinate-inheritance: | ||
|
||
Coordinate Inheritance | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Notice that in the trees we constructed above (LINK OR DISPLAY AGAIN?) there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups. | ||
|
||
We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups. | ||
|
||
.. note:: | ||
This is also a new feature relative to the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package. | ||
|
||
Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group: | ||
|
||
.. ipython:: python | ||
|
||
dt = xr.DataTree.from_dict( | ||
{ | ||
"/": ds.drop_dims("time"), | ||
"daily": ds_daily.drop_vars(["lat", "lon"]), | ||
"weekly": ds_weekly.drop_vars(["lat", "lon"]), | ||
"monthly": ds_monthly.drop_vars(["lat", "lon"]), | ||
} | ||
) | ||
dt | ||
|
||
This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates. | ||
Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations. | ||
|
||
We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups: | ||
|
||
.. ipython:: python | ||
|
||
dt.daily.coords | ||
dt["daily/lat"] | ||
|
||
(TODO: the repr of ``dt.coords`` should display which coordinates are inherited) | ||
|
||
As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group. | ||
|
||
If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such: | ||
|
||
.. ipython:: python | ||
|
||
print(dt["/daily"]) | ||
|
||
This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it. | ||
|
||
We can also still perform all the same operations on the whole tree: | ||
|
||
.. ipython:: python | ||
:okexcept: | ||
|
||
dt.sel(lat=[75], lon=[300]) | ||
|
||
dt.std(dim="time") | ||
|
||
(TODO: The second one fails due to https://github.com/pydata/xarray/issues/8949) | ||
|
||
.. _overriding-inherited-coordinates: | ||
|
||
Overriding Inherited Coordinates | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
We can override inherited coordinates with newly-defined ones, as long as those newly-defined coordinates also align with the parent nodes. | ||
|
||
EXAMPLE OF THIS? WOULD IT MAKE MORE SENSE TO USE DIFFERENT DATA TO DEMONSTRATE THIS? | ||
|
||
EXAMPLE OF INHERITING FROM A GRANDPARENT? | ||
|
||
EXPLAIN DEDUPLICATION? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to use this link but I can't seem to get it to work properly