Skip to content

Commit

Permalink
DAS-2155: Updates datastructures with extensive (too?) example
Browse files Browse the repository at this point in the history
Updates roadmap
Squashes a bunch of it's -> its typos.
  • Loading branch information
flamingbear committed Jul 26, 2024
1 parent b6e6163 commit 9e4b737
Show file tree
Hide file tree
Showing 5 changed files with 58 additions and 40 deletions.
3 changes: 3 additions & 0 deletions doc/getting-started-guide/quick-overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,9 @@ You can directly read and write xarray objects to disk using :py:meth:`~xarray.D
It is common for datasets to be distributed across multiple files (commonly one file per timestep). Xarray supports this use-case by providing the :py:meth:`~xarray.open_mfdataset` and the :py:meth:`~xarray.save_mfdataset` methods. For more, see :ref:`io`.


.. _quick-overview-datatrees:

DataTrees
---------

Expand Down
17 changes: 13 additions & 4 deletions doc/roadmap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,13 @@ extensions.
Tree-like data structure
++++++++++++++++++++++++

.. note::

After some time, the community DataTree project has now been updated and
merged into xarray exposing :py:class:`xarray.DataTree`. This is just
released and a bit experimental, but please try it out and let us know what
you think. Take a look at our :ref:`quick-overview-datatrees` quickstart.

Xarray’s highest-level object was previously an ``xarray.Dataset``, whose data
model echoes that of a single netCDF group. However real-world datasets are
often better represented by a collection of related Datasets. Particular common
Expand All @@ -219,10 +226,12 @@ A new tree-like data structure, ``xarray.DataTree``, which is essentially a
structured hierarchical collection of Datasets, represents these cases and
instead maps to multiple netCDF groups (see :issue:`4118`).

Currently there are several libraries which have wrapped xarray in order to build
domain-specific data structures (e.g. `xarray-multiscale <https://github.com/JaneliaSciComp/xarray-multiscale>`__.),
but the general ``xarray.DataTree`` object obviates the need for these and
consolidates effort in a single domain-agnostic tool, much as xarray has already achieved.
Currently there are several libraries which have wrapped xarray in order to
build domain-specific data structures (e.g. `xarray-multiscale
<https://github.com/JaneliaSciComp/xarray-multiscale>`__.), but the general
``xarray.DataTree`` object obviates the need for these and consolidates effort
in a single domain-agnostic tool, much as xarray has already achieved.


Labeled array without coordinates
+++++++++++++++++++++++++++++++++
Expand Down
74 changes: 40 additions & 34 deletions doc/user-guide/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Coordinates can be specified in the following ways:
arguments for :py:class:`~xarray.Variable`
* A pandas object or scalar value, which is converted into a ``DataArray``
* A 1D array or list, which is interpreted as values for a one dimensional
coordinate variable along the same dimension as it's name
coordinate variable along the same dimension as its name

- A dictionary of ``{coord_name: coord}`` where values are of the same form
as the list. Supplying coordinates as a dictionary allows other coordinates
Expand Down Expand Up @@ -260,8 +260,6 @@ In this example, it would be natural to call ``temperature`` and
variables" because they label the points along the dimensions. (see [1]_ for
more background on this example).

.. _dataarray constructor:

Creating a Dataset
~~~~~~~~~~~~~~~~~~

Expand All @@ -276,7 +274,7 @@ variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``).
arguments for :py:class:`~xarray.Variable`
* A pandas object, which is converted into a ``DataArray``
* A 1D array or list, which is interpreted as values for a one dimensional
coordinate variable along the same dimension as it's name
coordinate variable along the same dimension as its name

- ``coords`` should be a dictionary of the same form as ``data_vars``.

Expand Down Expand Up @@ -614,17 +612,15 @@ instance, if we try to create a `cycle`, where the root node is also a child of
dt.parent = node3
Alternatively you can also create a ``DataTree`` object from
Alternatively you can also create a ``DataTree`` object from:

- A dictionary mapping directory-like paths to either ``DataTree`` nodes or
data, using :py:meth:`xarray.DataTree.from_dict()`,
- A well formed netCDF or Zarr file on disk with
:py:func:`open_datatree()`. See :ref:`reading and writing files <io>`.
- A dictionary mapping directory-like paths to either ``DataTree`` nodes or data, using :py:meth:`xarray.DataTree.from_dict()`,
- A well formed netCDF or Zarr file on disk with :py:func:`open_datatree()`. See :ref:`reading and writing files <io>`.

For data files with groups that do not not align see
:py:func:`xarray.open_groups()` or use
:py:func:`xarray.open_dataset(group='target_group')`. For more information
about coordinate alignment see :ref:`datatree-inheritance`
:py:func:`xarray.open_groups()` or target each group individually
:py:func:`xarray.open_dataset(group='groupname') <xarray.open_dataset>`. For
more information about coordinate alignment see :ref:`datatree-inheritance`



Expand All @@ -642,7 +638,7 @@ but with values given by either ``xarray.DataArray`` objects or other
Iterating over keys will iterate over both the names of variables and child nodes.

We can also access all the data in a single node through a dataset-like view
We can also access all the data in a single node, and its inerited coordinates, through a dataset-like view

.. ipython:: python
Expand All @@ -658,6 +654,15 @@ as a new (and mutable) ``xarray.Dataset`` object via
dt["a"].to_dataset()
This same call can be made to get only the local node variables without any
inherited ones, by setting the inherited keyword to False, but in this example
there are no inherited coordinates so the result is the same as the previous call.

.. ipython:: python
dt["a"].to_dataset(inherited=False)
Like with ``Dataset``, you can access the data and coordinate variables of a
node separately via the ``data_vars`` and ``coords`` attributes:

Expand Down Expand Up @@ -702,24 +707,25 @@ DataTree Inheritance
~~~~~~~~~~~~~~~~~~~~

DataTree implements a simple inheritance mechanism. Coordinates and their
associated indices are propagated from each node downward starting from the
root node. Coordinate inheritance was inspired by the NetCDF-CF inherited
dimensions, but DataTree's inheritance is slightly stricter and easier to
reason about.
associated indices are propagated from downward starting from the root node to
all descendent nodes. Coordinate inheritance was inspired by the NetCDF-CF
inherited dimensions, but DataTree's inheritance is slightly stricter yet
easier to reason about.

The constraint that this puts on a DataTree is that dimensions and indices that
are inherited must be aligned with any child's existing dimension or index.
This allows child nodes to use dimensions defined in ancestor nodes, without
duplicating that information, but on the flip side if a dimension dimname is
defined in on a node and that same dimname dimension in one of it's ancestors,
they must align (have the same index and size).
are inherited must be aligned with any direct decendent node's existing
dimension or index. This allows decendents to use dimensions defined in
ancestor nodes, without duplicating that information. But as a consequence, if
a dimension dimension-name is defined in on a node and that same dimension-name
exists in one of its ancestors, they must align (have the same index and
size).

Some examples:

.. ipython:: python
# Set up coordinates
times = xr.DataArray(data=["2022-01", "2023-01"], dims="time")
time = xr.DataArray(data=["2022-01", "2023-01"], dims="time")
stations = xr.DataArray(data=list("abcdef"), dims="station")
lon = [-100, -80, -60]
lat = [10, 20, 30]
Expand All @@ -728,29 +734,29 @@ Some examples:
wind_speed = xr.DataArray(np.ones((2, 6)) * 2, dims=("time", "station"))
pressure = xr.DataArray(np.ones((2, 6)) * 3, dims=("time", "station"))
air_temperature = xr.DataArray(np.ones((2, 6)) * 4, dims=("time", "station"))
dewpoint_temp = xr.DataArray(np.ones((2, 6)) * 5, dims=("time", "station"))
dewpoint = xr.DataArray(np.ones((2, 6)) * 5, dims=("time", "station"))
infrared = xr.DataArray(np.ones((2, 3, 3)) * 6, dims=("time", "lon", "lat"))
true_color = xr.DataArray(np.ones((2, 3, 3)) * 7, dims=("time", "lon", "lat"))
dt2 = xr.DataTree.from_dict(
{
"/": xr.Dataset(
coords={"time": times},
coords={"time": time},
),
"/weather_data": xr.Dataset(
"/weather": xr.Dataset(
coords={"station": stations},
data_vars={
"wind_speed": wind_speed,
"pressure": pressure,
},
),
"/weather_data/temperature": xr.Dataset(
"/weather/temperature": xr.Dataset(
data_vars={
"air_temperature": air_temperature,
"dewpoint_temp": dewpoint_temp,
"dewpoint": dewpoint,
},
),
"/satellite_image": xr.Dataset(
"/satellite": xr.Dataset(
coords={"lat": lat, "lon": lon},
data_vars={
"infrared": infrared,
Expand All @@ -765,21 +771,21 @@ Here there are four different coordinate variables, which apply to variables in

``time`` is a shared coordinate used by both ``weather`` and ``satellite`` variables
``station`` is used only for ``weather`` variables
``lat`` and ``lon`` are only use for ``satellite images``
``lat`` and ``lon`` are only use for ``satellite`` images

Coordinate variables are inherited to descendent nodes, which means that
variables at different levels of a hierarchical DataTree are always
aligned. Placing the ``time`` variable at the root node automatically indicates
that it applies to all descendent nodes. Similarly, ``station`` is in the base
``weather_data`` node, because it applies to all weather variables, both directly
in ``weather_data`` and in the ``temperature`` sub-tree.
``weather`` node, because it applies to all weather variables, both directly
in ``weather`` and in the ``temperature`` sub-tree.

Accessing any of the lower level trees as an ``xarray.Dataset`` would
automatically include coordinates from higher levels (e.g., time):
automatically include coordinates from higher levels (e.g., ``time`` and ``station``):

.. ipython:: python
dt2["/weather_data/temperature"].ds
dt2["/weather/temperature"].ds
.. _coordinates:
Expand Down
2 changes: 1 addition & 1 deletion doc/user-guide/terminology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ complete examples, please consult the relevant documentation.*
:term:`variables<Variable>`, :term:`dimensions<Dimension>`, :term:`coordinates<Coordinate>`,
and attributes.

The nodes in a tree are linked to one another, and each node is it's own instance of
The nodes in a tree are linked to one another, and each node is its own instance of
``DataTree`` object. Each node can have zero or more *children* (stored in a dictionary-like
manner under their corresponding *names*), and those child nodes can themselves have
children. If a node is a child of another node that other node is said to be its *parent*.
Expand Down
2 changes: 1 addition & 1 deletion xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1669,7 +1669,7 @@ def to_zarr(
_validate_dataset_names(dataset)

if zarr_version is None:
# default to 2 if store doesn't specify it's version (e.g. a path)
# default to 2 if store doesn't specify its version (e.g. a path)
zarr_version = int(getattr(store, "_store_version", 2))

if consolidated is None and zarr_version > 2:
Expand Down

0 comments on commit 9e4b737

Please sign in to comment.