Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse data #133

Open
mikapfl opened this issue Nov 2, 2022 · 11 comments
Open

sparse data #133

mikapfl opened this issue Nov 2, 2022 · 11 comments
Labels
priority: high High priority issue: prioritize and solve as soon as possible

Comments

@mikapfl
Copy link
Member

mikapfl commented Nov 2, 2022

we need more efficient handling of sparse data. Thinking about sparseness for every data set is worse than just having algorithms which handle sparse data.

In xarray (https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html?highlight=sparse#integrating-with-duck-arrays and https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py) this functionality is still experimental, but it could be pretty drop-in. Maybe we should see if we can make a primap array sparse and see if our tests still pass, that would be great!

@JGuetschow JGuetschow added the priority: high High priority issue: prioritize and solve as soon as possible label Mar 27, 2023
@JGuetschow
Copy link
Contributor

I set high priority for the test described above. If it's more work priority is medium as currently I can usually avoid memory issues by reducing dimensionality early.

@mikapfl
Copy link
Member Author

mikapfl commented Mar 29, 2023

Because sparse data was not originally a design goal for primap2, I expect it to be quite some work. If the xarray drop-in I highlighted above works, we still get the maintenance burden of using an experimental functionality in production. If the drop-in doesn't work, we probably have to fix things in xarray itself.

I think it would be reasonable to test the experimental functionality a bit, get a feel for how experimental it truly is, and then decide what we will do.

@mikapfl
Copy link
Member Author

mikapfl commented May 16, 2023

Another strategy for solving this problem is a “divide and conquer” approach. We do this already when processing e.g. by country so that dimensions don't blow up. There is now an interesting project within the xarray ecosystem to provide tools to do exactly that: https://github.com/xarray-contrib/datatree . One of their examples is CMIP6, where each model has slightly different variables and different resolution etc. Using datatree, the data is grouped into one datatree consisting of multiple xarray datasets, where each dataset uses its own dimensions, but operations on the whole datatree are still possible.

Worth to check out if this is a viable solution for us.

@JGuetschow
Copy link
Contributor

Definitely worth looking at. It would introduce a further distinction between types of metadata (nodes (e.g. source, scenario, and model), variables (entity), dimensions (category, time, ...)) making some functionality more complicated to implement, but keeping large datasets as one object without having to manually split datasets might make up for that.

@mikapfl
Copy link
Member Author

mikapfl commented May 16, 2023

At least for the "primap database" with all bells and whistles, with data from CRF, EDGAR, etc. all in the database, we need something which is higher-level than an xarray dataset anyway, I think. But that is not a current problem (while the sparseness problem is a current problem).

@znicholls
Copy link
Collaborator

znicholls commented Dec 4, 2023

Another option to consider: copy/vendor something out of scmdata.

These docs in scmdata show we deal with the sparsity issue in a slightly opaque way (search for 'sparse' to find the most relevant section) https://scmdata.readthedocs.io/en/latest/notebooks/netcdf.html#splitting-your-data

A better example is probably the below. The key output is that the array is 10 times smaller when you make it sparse and has 11 times fewer nan's when it's non-sparse. I think we should consider how we can do the same trick with primap data to avoid the memory issues we've discussed. scmdata implementation code is here https://github.com/openscm/scmdata/blob/7813392bf974ce9454975423518c9249200cd1f9/src/scmdata/_xarray.py#L11C45-L11C51 (lots of steps, but not terribly written)

We'd have to test how it actually works, but it might 'just work' as xarray does some smart things with collapsing dimensions and multi-indexing in my experience.

Example code with scmdataSomething like this shows the sparsity handling with scmdata

import numpy as np
from scmdata import ScmRun, run_append


generator = np.random.default_rng(0)


def new_timeseries(  # noqa: PLR0913
    n=100,
    count=1,
    model="example",
    scenario="ssp119",
    variable="Surface Temperature",
    unit="K",
    region="World",
    cls=ScmRun,
    **kwargs,
):
    """
    Create an example timeseries
    """
    data = generator.random((n, count)) * np.arange(n)[:, np.newaxis]
    index = 2000 + np.arange(n)
    return cls(
        data,
        columns={
            "model": model,
            "scenario": scenario,
            "variable": variable,
            "region": region,
            "unit": unit,
            **kwargs,
        },
        index=index,
    )


large_run = []
# 10 runs for each scenario
for sce in ["ssp119", "ssp370", "ssp585"]:
    large_run.extend(
        [
            new_timeseries(
                count=3,
                scenario=sce,
                variable=[
                    "Surface Temperature",
                    "Atmospheric Concentrations|CO2",
                    "Radiative Forcing",
                ],
                unit=["K", "ppm", "W/m^2"],
                paraset_id=paraset_id,
            )
            for paraset_id in range(10)
        ]
    )

large_run = run_append(large_run)

# also set a run_id (often we'd have paraset_id and run_id,
# one which keeps track of the parameter set we've run and
# the other which keeps track of the run in a large ensemble)
large_run["run_id"] = large_run.meta.index.values

# This is super sparse because run_id and paraset_id have lots of non-overlapping
# combinations
sparse_xr = large_run.to_xarray(dimensions=["scenario", "run_id", "paraset_id"])
print(f"{sparse_xr=}")
print(f"{sparse_xr.sizes=}")
print(f"{sparse_xr.isnull().sum()=}")

# The extras argument allows you to get around this by effectively saying which
# dimensions can be inferred from others. With this information, we can avoid
# the sparsity issue
non_sparse_xr = large_run.to_xarray(dimensions=["scenario", "run_id"], extras=["paraset_id"])
print(f"{non_sparse_xr=}")
print(f"{non_sparse_xr.sizes=}")
print(f"{non_sparse_xr.isnull().sum()=}")

print(
    "Ratio of nulls in sparse vs. non-sparse"
    f"{sparse_xr.isnull().sum() / non_sparse_xr.isnull().sum()}"
)

@mikapfl
Copy link
Member Author

mikapfl commented Dec 4, 2023

@znicholls Nice, thanks

just to understand the strategy, I'll try to summarize it and you can tell me if I misunderstood:

  1. We identify the sparse dimensions. Generally, a sparse dimension is one where if you project this dimension out, the relative sparsity of the array decreases significantly.
  2. Then, we identify all combinations of the sparse dimensions which are actually filled with data.
  3. We assign a primary key to these data points, the important thing is only that the values are unique, otherwise we don't care for the contents of the primary key. Small integers are probably a good choice for readability, but UUIDs would also be fine technically.
  4. We now use this primary key as the dimension in the xarray.DataArray, and demote the other dimensions it replaces to mere coordinates which use the new primary key as their dimension.

Right?

That's quite nifty. If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates:

In [16]: non_sparse_xr = non_sparse_xr.set_xindex("paraset_id")

In [17]: non_sparse_xr.loc[{"paraset_id": 8}]
Out[17]: 
<xarray.Dataset>
Dimensions:                         (time: 100, run_id: 9, scenario: 3)
Coordinates:
  * time                            (time) object 2000-01-01 00:00:00 ... 209...
  * run_id                          (run_id) int64 24 25 26 54 55 56 84 85 86
  * scenario                        (scenario) object 'ssp119' 'ssp370' 'ssp585'
  * paraset_id                      (run_id) int64 8 8 8 8 8 8 8 8 8
Data variables:
    Atmospheric Concentrations|CO2  (scenario, run_id, time) float64 nan ... nan
    Radiative Forcing               (scenario, run_id, time) float64 nan ... ...
    Surface Temperature             (scenario, run_id, time) float64 0.0 ... nan
Attributes:
    scmdata_metadata_region:  World
    scmdata_metadata_model:   example

This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage.

@mikapfl
Copy link
Member Author

mikapfl commented Dec 4, 2023

@JGuetschow do you have a nice sparse dataset I can play with? I can of course easily build an artificial sparse dataset, but if you happen to just have something around which is real data and sparse, I could use that and be closer to real usage than with something artificially generated.

@JGuetschow
Copy link
Contributor

Any of the output/BUR_NIR_combined_all_XXXX datasets in the UNFCCC_data repo (it's private but you should have access)

@znicholls
Copy link
Collaborator

Right?

You got it (and whether that primary key is another dimension or something we add automatically doesn't really matter but was something that was a bit of mucking around to automate from memory).

If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates

That's cool. Maybe xarray has thought about this problem too and has an even better solution as that API didn't exist when we first wrote it I don't think and it seems quite specific to such handling...

This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage

Nice

@mikapfl
Copy link
Member Author

mikapfl commented Jan 12, 2024

So I played around a little with this while waiting for Excel things in ndcs land.

It works in principle, but there are definitely some things to work out still:

  • The auxiliary dimension has to remain visible as a coordinate, otherwise alignment will not work at all. Also it has to be a pandas MultiIndex if I understand things correctly.
  • This means we have to standardize the naming of the auxiliary dimension and allow it in the data format. Not difficult, but work to do.
  • This also needs special-casing in conversions to interchange format - hardly surprising, but still work to do.
  • Problematic is that for alignment during arithmetic and similar operations, the auxiliary dimension is used, not the coordinates defined on it. In practice that means combining multiple datasets which both have auxiliary dimensions has to be done with care to make sure the auxiliary dimensions are not overlapping. Maybe we can make a standard which works surprisingly often (something like encoding the values of the real coordinates into the auxiliary dimension) , but in general I fear there might be surprises. Sure, non-dimension coordinates are a standard thing in xarray, but I'm not sure if they are supported as well as proper dimension coordinates.
  • We also need to think a bit about the point when we define auxiliary coordinates - do it too early, and you have to re-define them when combining multiple datasets (see prior point). Do it too late and you are already wasting lots of memory before conversion.
  • A central API we would need to see if it can be made to work would be (da.pr.set)[https://primap2.readthedocs.io/en/stable/usage.html#Setting-data].
  • Stuff like da.pr.loc does not work yet, because it works on dimensions, not coordinates. Probably rather easy to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high High priority issue: prioritize and solve as soon as possible
Projects
None yet
Development

No branches or pull requests

3 participants