Allow in-memory arrays with open_mfdataset #5704

Illviljan · 2021-08-13T09:50:26Z

The docstring seems to imply that it's possible to get in-memory arrays:

xarray/xarray/backends/api.py

Line 732 in 4bb9d9c

each dimension by ``chunks``. By default, chunks will be chosen to load entire

But it doesn't seem possible because of:

xarray/xarray/backends/api.py

Line 899 in 4bb9d9c

open_kwargs = dict(engine=engine, chunks=chunks or {}, **kwargs)

This PR removes that or check, changes the default to chunk={}, and fixes the failing tests.

Noticed in get open_mfdataset to return numpy arrays #5689
Closes If "chunks=None" is set in open_mfdataset, it is changed to "chunks={}" before being passed to "_dataset_from_backend_dataset" #7792
Closes Can't call open_mfdataset without creating chunked dask arrays #9038
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

github-actions · 2021-08-13T10:18:22Z

Unit Test Results

        6 files         6 suites 55m 1s ⏱️
16 325 tests 14 581 ✔️ 1 744 💤 0 ❌
91 146 runs 82 854 ✔️ 8 292 💤 0 ❌

Results for commit 3444281.

♻️ This comment has been updated with latest results.

Illviljan · 2021-08-13T10:23:23Z

A lot of failing tests but they seem to just assume that open_mfdataset always returns dask arrays by default. Probably as simple as adding chunks={} in all these tests, but this is quite a breaking change.

Do you know the reason why chunks=chunks or {} is used in open_mfdataset, @aurghs?

raybellwaves · 2021-08-19T03:37:09Z

See #5689 for reference to this PR

Illviljan · 2021-08-21T21:06:38Z

One way of making this less controversial is to also change the default value of chunks from None to {} here

xarray/xarray/backends/api.py

Line 696 in 48a9dbe

chunks=None,

Then the default settings will behave the same as before. Although it's still not consistent with xr.open_datasets default parameters which mfdataset is just a thin wrapper around.

It is indeed bad practice to use dicts as default value but not completely uncommon, see for example:

xarray/xarray/core/dataset.py

Line 2111 in 48a9dbe

    
           ] = {},  # {} even though it's technically unsafe, is being used intentionally here (#4667)

shoyer · 2021-08-22T23:27:45Z

The reason why open_mfdataset always uses dask is because otherwise it would not be lazy: the netCDF files would be immediately read into memory as NumPy arrays. open_dataset uses Xarray's own internal lazy indexing machinery, but that machinery doesn't (yet) support lazy concatenation or broadcasting, so it doesn't suffice for open_mfdataset.

We certainly could make a similar change to this, but I would not do so by default. Or I would add support for lazy concatenation into xarray's lazy indexing, and then we could slowly roll out a breaking change (with appropriate FutureWarning, etc).

Illviljan · 2021-08-23T04:55:06Z

That the arrays would be loaded into memory is what you would expect if a user insists on using chunks=None right?

I just changed the default value to {}. So now it will behave as it did previously but with the possibility to load into memory for whatever reason you might have with small files.

TomNicholas · 2023-04-28T14:57:54Z

For the benefit of anyone else reading this having come from #7792 or similar questions - see #4628 and #5081 to see what needs to be done. Also see discussion in #6807 for non-dask lazy backends.

Illviljan · 2023-04-29T06:56:37Z

Those issues indeed has to be fixed if opening files lazily is the only option for xarray.

But xarray could also accept that chunks=None will (for now) load all the files to memory. If that's ok we can merge this now I believe.
I suspect there are a few in-memory users out there that could make use of this.

I just found out that `open_mfdataset` always requires dask even if `chunks=None`. This may change in the future (see pydata/xarray#5704).

TomNicholas · 2024-05-22T14:04:00Z

I also ran into a case where I wanted to be able to opt-in to using open_mfdataset without ever creating chunked arrays (and was happy to accept eager loading). (#9038)

It seems we have multiple different issues and PRs asking for the same thing here, a way to prevent breaking changes (i.e. changing the default to {}), and a longer-term ideal plan (implementing lazy concatenation and changing the default to None with a deprecation cycle). I suggest we just move forward with merging this.

xarray/tests/test_backends.py

TomNicholas · 2024-06-05T17:01:51Z

So we (@shoyer and @dcherian) discussed this in the dev meeting call just now, and I think the conclusion was that:

Fixing this bug to make chunks=None not use dask, and therefore eagerly concatenate arrays, would be a breaking change for anyone who is currently passing chunks=None and getting lazy behaviour. Whilst the current behaviour is very misleading, to be changed this should still have a deprecation cycle.
The alternative suggestion to make chunks={} the default would create an inconsistency between the defaults of open_dataset and open_mfdataset.
The only reason any of this is a problem is because we don't yet have xarray-native lazy concatenation (see Lazy concatenation of arrays #4628).
Therefore the eventual solution should involve implementing lazy concatenation, and having all the defaults and meaning of args be the same between open_dataset and open_mfdataset.
In fact this problem of "which library is handling lazy array operations" is complicated enough that it probably deserves its own argument - it also crosses over with the ChunkManager feature. The chunks kwargs is also kind of overloaded - it should just mean "what shape chunks do I want?".
But we do want an option to not use dask in open_mfdataset. So @shoyer's suggestion was to add another new argument to open_mfdataset that controls either whether or not to expect lazy behaviour or which array type is being used to represent the arrays to be concatenated. That way we can enable users to opt-in to eager loading, but keep the same set of defaults that we want in the long term, and have a deprecation cycle for changing the meaning of (/perhaps even just removing?) chunks=None.

I'm not quite clear on what the explicit suggestion for this new kwarg would be though... Or whether it can instead be a special ChunkManager? (e.g. chunked_array_type='numpy')

dcherian · 2024-06-05T17:06:57Z

Excellent summary @TomNicholas .

I think we also agreed on changing the default in the signature of open_mfdataset to chunks={} but continue to treat chunks=None as synonymous with chunks={} for now.

Allow in memory arrays

69eb978

Illviljan added 6 commits August 13, 2021 21:49

fix tests

d2a474e

Update test_backends.py

9e3acc3

make sure ndarrays are returned when None

31a6ac0

update chunks with all conditions

f3234c4

Check the previous chunk default value as well

695477d

Update whats-new.rst

13f07c9

Illviljan marked this pull request as ready for review August 15, 2021 09:24

Merge branch 'main' into mfdataset_allow_numpy

5c66169

max-sixty added the needs discussion label Aug 22, 2021

Use {} as default value

873911a

Illviljan added 3 commits August 23, 2021 06:59

Update whats-new.rst

d9e1661

Update whats-new.rst

a6af772

Merge branch 'main' into mfdataset_allow_numpy

3444281

keewis mentioned this pull request Nov 25, 2021

Allow chunks=None for dask-less lazy loading #6028

Closed

5 tasks

Illviljan mentioned this pull request Nov 28, 2021

Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) #6033

Closed

Illviljan mentioned this pull request Apr 27, 2023

If "chunks=None" is set in open_mfdataset, it is changed to "chunks={}" before being passed to "_dataset_from_backend_dataset" #7792

Open

4 tasks

juseg added a commit to juseg/hyoga that referenced this pull request Jun 9, 2023

Add missing dependency to dask.

9bacda4

I just found out that `open_mfdataset` always requires dask even if `chunks=None`. This may change in the future (see pydata/xarray#5704).

Illviljan mentioned this pull request May 22, 2024

Can't call open_mfdataset without creating chunked dask arrays #9038

Open

5 tasks

Merge branch 'main' into mfdataset_allow_numpy

c227117

Update test_backends.py

04cf17f

Illviljan added 2 commits May 23, 2024 23:48

Merge branch 'main' into mfdataset_allow_numpy

9558e1c

Update whats-new.rst

12412db

dcherian reviewed May 24, 2024

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

dcherian requested a review from shoyer May 24, 2024 02:26

Illviljan commented May 24, 2024

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Illviljan added 3 commits May 24, 2024 07:19

Update xarray/tests/test_backends.py

81dbef9

Update xarray/tests/test_backends.py

493680f

Apply suggestions from code review

fb4138d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow in-memory arrays with open_mfdataset #5704

Allow in-memory arrays with open_mfdataset #5704

Illviljan commented Aug 13, 2021 •

edited

Loading

github-actions bot commented Aug 13, 2021 •

edited

Loading

Illviljan commented Aug 13, 2021 •

edited

Loading

raybellwaves commented Aug 19, 2021 •

edited

Loading

Illviljan commented Aug 21, 2021

shoyer commented Aug 22, 2021

Illviljan commented Aug 23, 2021

TomNicholas commented Apr 28, 2023

Illviljan commented Apr 29, 2023 •

edited

Loading

TomNicholas commented May 22, 2024 •

edited

Loading

TomNicholas commented Jun 5, 2024

dcherian commented Jun 5, 2024

Allow in-memory arrays with open_mfdataset #5704

Are you sure you want to change the base?

Allow in-memory arrays with open_mfdataset #5704

Conversation

Illviljan commented Aug 13, 2021 • edited Loading

github-actions bot commented Aug 13, 2021 • edited Loading

Unit Test Results

Illviljan commented Aug 13, 2021 • edited Loading

raybellwaves commented Aug 19, 2021 • edited Loading

Illviljan commented Aug 21, 2021

shoyer commented Aug 22, 2021

Illviljan commented Aug 23, 2021

TomNicholas commented Apr 28, 2023

Illviljan commented Apr 29, 2023 • edited Loading

TomNicholas commented May 22, 2024 • edited Loading

TomNicholas commented Jun 5, 2024

dcherian commented Jun 5, 2024

Illviljan commented Aug 13, 2021 •

edited

Loading

github-actions bot commented Aug 13, 2021 •

edited

Loading

Illviljan commented Aug 13, 2021 •

edited

Loading

raybellwaves commented Aug 19, 2021 •

edited

Loading

Illviljan commented Apr 29, 2023 •

edited

Loading

TomNicholas commented May 22, 2024 •

edited

Loading