Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow in-memory arrays with open_mfdataset #5704

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Aug 13, 2021

The docstring seems to imply that it's possible to get in-memory arrays:

each dimension by ``chunks``. By default, chunks will be chosen to load entire

But it doesn't seem possible because of:

open_kwargs = dict(engine=engine, chunks=chunks or {}, **kwargs)

This PR removes that or check, changes the default to chunk={}, and fixes the failing tests.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 13, 2021

Unit Test Results

         6 files           6 suites   55m 1s ⏱️
16 325 tests 14 581 ✔️ 1 744 💤 0
91 146 runs  82 854 ✔️ 8 292 💤 0

Results for commit 3444281.

♻️ This comment has been updated with latest results.

@Illviljan
Copy link
Contributor Author

Illviljan commented Aug 13, 2021

A lot of failing tests but they seem to just assume that open_mfdataset always returns dask arrays by default. Probably as simple as adding chunks={} in all these tests, but this is quite a breaking change.

Do you know the reason why chunks=chunks or {} is used in open_mfdataset, @aurghs?

@Illviljan Illviljan marked this pull request as ready for review August 15, 2021 09:24
@raybellwaves
Copy link
Contributor

raybellwaves commented Aug 19, 2021

See #5689 for reference to this PR

@Illviljan
Copy link
Contributor Author

One way of making this less controversial is to also change the default value of chunks from None to {} here

chunks=None,

Then the default settings will behave the same as before. Although it's still not consistent with xr.open_datasets default parameters which mfdataset is just a thin wrapper around.

It is indeed bad practice to use dicts as default value but not completely uncommon, see for example:

] = {}, # {} even though it's technically unsafe, is being used intentionally here (#4667)

@shoyer
Copy link
Member

shoyer commented Aug 22, 2021

The reason why open_mfdataset always uses dask is because otherwise it would not be lazy: the netCDF files would be immediately read into memory as NumPy arrays. open_dataset uses Xarray's own internal lazy indexing machinery, but that machinery doesn't (yet) support lazy concatenation or broadcasting, so it doesn't suffice for open_mfdataset.

We certainly could make a similar change to this, but I would not do so by default. Or I would add support for lazy concatenation into xarray's lazy indexing, and then we could slowly roll out a breaking change (with appropriate FutureWarning, etc).

@Illviljan
Copy link
Contributor Author

That the arrays would be loaded into memory is what you would expect if a user insists on using chunks=None right?

I just changed the default value to {}. So now it will behave as it did previously but with the possibility to load into memory for whatever reason you might have with small files.

@TomNicholas
Copy link
Member

For the benefit of anyone else reading this having come from #7792 or similar questions - see #4628 and #5081 to see what needs to be done. Also see discussion in #6807 for non-dask lazy backends.

@Illviljan
Copy link
Contributor Author

Illviljan commented Apr 29, 2023

Those issues indeed has to be fixed if opening files lazily is the only option for xarray.

But xarray could also accept that chunks=None will (for now) load all the files to memory. If that's ok we can merge this now I believe.
I suspect there are a few in-memory users out there that could make use of this.

juseg added a commit to juseg/hyoga that referenced this pull request Jun 9, 2023
I just found out that `open_mfdataset` always requires dask even if
`chunks=None`. This may change in the future (see pydata/xarray#5704).
@TomNicholas
Copy link
Member

TomNicholas commented May 22, 2024

I also ran into a case where I wanted to be able to opt-in to using open_mfdataset without ever creating chunked arrays (and was happy to accept eager loading). (#9038)

It seems we have multiple different issues and PRs asking for the same thing here, a way to prevent breaking changes (i.e. changing the default to {}), and a longer-term ideal plan (implementing lazy concatenation and changing the default to None with a deprecation cycle). I suggest we just move forward with merging this.

@dcherian dcherian requested a review from shoyer May 24, 2024 02:26
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

So we (@shoyer and @dcherian) discussed this in the dev meeting call just now, and I think the conclusion was that:

  • Fixing this bug to make chunks=None not use dask, and therefore eagerly concatenate arrays, would be a breaking change for anyone who is currently passing chunks=None and getting lazy behaviour. Whilst the current behaviour is very misleading, to be changed this should still have a deprecation cycle.
  • The alternative suggestion to make chunks={} the default would create an inconsistency between the defaults of open_dataset and open_mfdataset.
  • The only reason any of this is a problem is because we don't yet have xarray-native lazy concatenation (see Lazy concatenation of arrays #4628).
  • Therefore the eventual solution should involve implementing lazy concatenation, and having all the defaults and meaning of args be the same between open_dataset and open_mfdataset.
  • In fact this problem of "which library is handling lazy array operations" is complicated enough that it probably deserves its own argument - it also crosses over with the ChunkManager feature. The chunks kwargs is also kind of overloaded - it should just mean "what shape chunks do I want?".
  • But we do want an option to not use dask in open_mfdataset. So @shoyer's suggestion was to add another new argument to open_mfdataset that controls either whether or not to expect lazy behaviour or which array type is being used to represent the arrays to be concatenated. That way we can enable users to opt-in to eager loading, but keep the same set of defaults that we want in the long term, and have a deprecation cycle for changing the meaning of (/perhaps even just removing?) chunks=None.

I'm not quite clear on what the explicit suggestion for this new kwarg would be though... Or whether it can instead be a special ChunkManager? (e.g. chunked_array_type='numpy')

@dcherian
Copy link
Contributor

dcherian commented Jun 5, 2024

Excellent summary @TomNicholas .

I think we also agreed on changing the default in the signature of open_mfdataset to chunks={} but continue to treat chunks=None as synonymous with chunks={} for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants