Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatree alignment docs #9501

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
ae71437
remove too-long underline
TomNicholas Sep 15, 2024
928767a
draft section on data alignment
TomNicholas Sep 15, 2024
1adb945
fixes
TomNicholas Sep 15, 2024
ae1bcfd
draft section on coordinate inheritance
TomNicholas Sep 15, 2024
f025371
various improvements
TomNicholas Sep 15, 2024
7549ee9
more improvements
TomNicholas Sep 15, 2024
b631697
link from other page
TomNicholas Sep 15, 2024
02bf96b
align call include all 3 datasets
TomNicholas Sep 15, 2024
152d74a
link back to use cases
TomNicholas Sep 15, 2024
57b7f06
clarification
TomNicholas Sep 15, 2024
d3ac1a7
small improvements
TomNicholas Sep 15, 2024
adf7579
Merge branch 'main' into datatree_alignment_docs
TomNicholas Sep 23, 2024
d73dd8a
remove TODO after #9532
TomNicholas Sep 23, 2024
d779e22
add todo about #9475
TomNicholas Sep 23, 2024
3c9ad55
correct xr.align example call
TomNicholas Sep 23, 2024
5a4309a
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Sep 23, 2024
4cee745
add links to netCDF4 documentation
TomNicholas Sep 23, 2024
4c030d8
Consistent voice
TomNicholas Sep 23, 2024
09385fd
Merge branch 'main' into datatree_alignment_docs
TomNicholas Sep 26, 2024
35ab311
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 6, 2024
6db4a0b
keep indexes in lat lon selection to dodge #9475
TomNicholas Oct 6, 2024
22f2726
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Oct 6, 2024
e879dbb
unpack generator properly
TomNicholas Oct 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/user-guide/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -800,6 +800,7 @@ included by default unless you exclude them with the ``inherited`` flag:

dt2["/weather/temperature"].to_dataset(inherited=False)

For more examples and further discussion see LINK
Copy link
Member Author

@TomNicholas TomNicholas Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to use this link but I can't seem to get it to work properly

:ref:`Alignment and Coordinate Inheritance <userguide.hierarchical-data.alignment-and-coordinate-inheritance>


.. _coordinates:

Expand Down
163 changes: 162 additions & 1 deletion doc/user-guide/hierarchical-data.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _hierarchical-data:

Hierarchical data
==============================
=================

.. ipython:: python
:suppress:
Expand Down Expand Up @@ -644,3 +644,164 @@ We could use this feature to quickly calculate the electrical power in our signa

power = currents * voltages
power

.. _alignment-and-coordinate-inheritance:

Alignment and Coordinate Inheritance
------------------------------------

.. _data-alignment:

Data Alignment
~~~~~~~~~~~~~~
Comment on lines +657 to +658
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add comment about open_groups being useful if your data doesn't align

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only note I have on open_groups, it probably deserves more. https://github.com/pydata/xarray/blob/main/doc/getting-started-guide/quick-overview.rst?plain=1#L284


The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be aligned (LINK HERE) with those in their parent nodes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I would like to link to some generic documentation on what alignment is, but it doesn't really exist, see #9500.


.. note::
If you were a previous user of the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package, this is different from what you're used to!
In that package the data model was that nodes actually were completely unrelated. The data model is now slightly stricter.
This allows us to provide features like :ref:`coordinate-inheritance`. See the migration guide for more details on the differences (LINK).

To demonstrate, let's first generate some example datasets which are not aligned with one another:

.. ipython:: python

# (drop the attributes just to make the printed representation shorter)
ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()

ds_daily = ds.resample(time="D").mean("time")
ds_weekly = ds.resample(time="W").mean("time")
ds_monthly = ds.resample(time="ME").mean("time")

These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension.

.. ipython:: python

ds_daily.sizes
ds_weekly.sizes
ds_monthly.sizes

We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be more correct to say that we cannot store them unchanged.


.. ipython:: python
:okexcept:

xr.align(ds_daily, ds_weekly, ds_monthly, join="exact")

But we previously said that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`?
If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error:

.. ipython:: python
:okexcept:

xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly})

(TODO: Looks like this error message could be improved by including information about which sizes are not equal.)

This is because DataTree checks that data in child nodes align exactly with their parents.

.. note::
This requirement of aligned dimensions is similar to netCDF's concept of inherited dimensions (LINK TO NETCDF DOCUMENTATION?).
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds:

.. code::

xr.align(child, *child.parents, join="exact")
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not parents of one another, i.e. organize them as siblings.

.. ipython:: python

dt = xr.DataTree.from_dict(
{"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly}
)
dt

Now we have a valid :py:class:`~xarray.DataTree` structure which contains the data at different time frequencies.

This is a useful way to organise our data because we can still operate on all the groups at once.
For example we can extract all three timeseries at a specific lat-lon location:

.. ipython:: python

dt.sel(lat=75, lon=300)

or compute the standard deviation of each timeseries to find out how it varies with sampling frequency:

.. ipython:: python

dt.std(dim="time")

.. _coordinate-inheritance:

Coordinate Inheritance
~~~~~~~~~~~~~~~~~~~~~~

Notice that in the trees we constructed above (LINK OR DISPLAY AGAIN?) there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups.
We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups.

.. note::
This is also a new feature relative to the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package.

Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group:

.. ipython:: python

dt = xr.DataTree.from_dict(
{
"/": ds.drop_dims("time"),
"daily": ds_daily.drop_vars(["lat", "lon"]),
"weekly": ds_weekly.drop_vars(["lat", "lon"]),
"monthly": ds_monthly.drop_vars(["lat", "lon"]),
}
)
dt

(TODO: They are being displayed in child groups still, see https://github.com/pydata/xarray/issues/9499)

This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates.
Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations.

We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups:

.. ipython:: python

dt.daily.coords
dt["daily/lat"]

(TODO: the repr of ``dt.coords`` should display which coordinates are inherited)

As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group.

If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such:

.. ipython:: python

print(dt["/daily"])

This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it.

We can also still perform all the same operations on the whole tree:

.. ipython:: python
:okexcept:

dt.sel(lat=75, lon=300)

dt.std(dim="time")

(TODO: The second one fails due to https://github.com/pydata/xarray/issues/8949)

.. _overriding-inherited-coordinates:

Overriding Inherited Coordinates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can override inherited coordinates with newly-defined ones, as long as those newly-defined coordinates also align with the parent nodes.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

EXAMPLE OF THIS? WOULD IT MAKE MORE SENSE TO USE DIFFERENT DATA TO DEMONSTRATE THIS?

EXAMPLE OF INHERITING FROM A GRANDPARENT?

EXPLAIN DEDUPLICATION?
Loading