Disable automatic cache with dask #1024

crusaderky · 2016-10-01T11:08:24Z

Disabled auto-caching on dask; new .compute() method

Fixes #902

… .values property. Added new method .compute().

# Conflicts: # xarray/test/test_dask.py

crusaderky · 2016-10-01T16:41:04Z

Well, crud. This introduces a regression where DataArray.chunk() converts the data and the coords to dask. This becomes enormously problematic later on as pretty much nothing expects a dask-based coord.

[edit] fixed below

shoyer

Well, crud. This introduces a regression where DataArray.chunk() converts the data and the coords to dask. This becomes enormously problematic later on as pretty much nothing expects a dask-based coord.

I agree that this could make sense, but I'd like to understand in a little more detail. Isn't this how things work already, even before this PR?

I agree that turning an existing pandas.Index (dimension labels) into a chunked dask array is probably not desirable, but it is less obvious to me that .chunk() should not apply to other coordinates.

I am also a little concerned about how this would affect existing coordinate variables that are already dask.arrays. If an coordinate is already chunked, then calling .chunk() with an new chunksize should probably change it.

Either way, any changes here should be documented under "Breaking changes" in what's new.

shoyer · 2016-10-03T15:26:03Z

xarray/core/dataset.py

            if v.chunks is not None:
                new_chunks = list(zip(v.dims, v.chunks))
                if any(chunk != chunks[d] for d, chunk in new_chunks
                       if d in chunks):
                    raise ValueError('inconsistent chunks')
                chunks.update(new_chunks)
+        if chunks:


Why should this need chunks to not be empty already? That seems strange (maybe backwards) to me.

I might simply make this:

for dim, size in self.dims.items(): if dim not in chunks: chunks[dim] = (size,)

if none of the data_vars use the dask backend, then you want chunks to return None.

I guess this method is inconsistent with Variable.chunks, but it currently always returns a dict.

I would either skip this change or use something like my version.

crusaderky · 2016-10-03T19:58:53Z

What happened before this PR was that all coords were blindly converted to dask on chunk(). Then, the first time anything invoked the values property, e.g. Something as simple as DataArray.__str__., they were silently converted back to numpy. It wasn't easy to accidentally get them in dask format; in fact no unit test noticed before my last commit.

If you deliberately use a dask array as a coord, it won't be converted to numpy. However I can't think of any reason why anybody would want to do it in practice.
I'll add it to the breaking changes as if somebody did do the above, the performance of his program will degrade with this release as his coord will risk being evaluated multiple times.

# Conflicts: # doc/whats-new.rst

crusaderky · 2016-10-06T14:39:47Z

I added the disclaimer in the release notes.
Is there any other outstanding issue?
Thanks

crusaderky · 2016-10-06T17:34:01Z

I can't reproduce the above failure test.test_conventions.TestEncodeCFVariable.testMethod=test_missing_fillvalue.
I suspect it might be a random failure, because

it used to succeed until my latest commit, which eclusively changed the readme
it suceeds on python 3

… pickling

shoyer

Apologies for the delay here -- my comments were stuck as a "pending" GitHub review.

I am still wondering what the right behavior is for variables used as indexes. (These can be dask arrays, too.)

I think there is a good case for skipping these variables in .chunk(), but we probably do want to make indexes still cache as pandas.Index objects, because otherwise repeated evaluation of dask arrays to build the index for alignment or indexing gets expensive.

shoyer · 2016-10-12T01:10:02Z

xarray/core/dataset.py

@@ -792,13 +806,19 @@ def chunks(self):
        array.
        """
        chunks = {}
-        for v in self.variables.values():
+        for v in self.data_vars.values():


I am concerned about skipping non-data_vars here. Coordinates could still be chunked, e.g., if they were loaded from a file, or created directly from dask arrays.

shoyer · 2016-10-12T01:12:19Z

xarray/core/dataset.py

            if v.chunks is not None:
                new_chunks = list(zip(v.dims, v.chunks))
                if any(chunk != chunks[d] for d, chunk in new_chunks
                       if d in chunks):
                    raise ValueError('inconsistent chunks')
                chunks.update(new_chunks)
+        if chunks:


I guess this method is inconsistent with Variable.chunks, but it currently always returns a dict.

I would either skip this change or use something like my version.

shoyer · 2016-10-12T01:15:16Z

xarray/core/dataset.py

@@ -851,6 +871,9 @@ def selkeys(dict_, keys):
            return dict((d, dict_[d]) for d in keys if d in dict_)

        def maybe_chunk(name, var, chunks):
+            if name not in self.data_vars:


I see your point about performance, but I think that mostly holds true for indexes. So I would be inclined to adjust this to only skip variables in self.dims (aka indexes used for alignment).

I am still concerned about skipping coords if they are already dask arrays. If they are already dask arrays, then .chunk() should probably adjust their chunks anyways.

crusaderky · 2016-10-12T06:49:23Z

I've been thinking about this... Maybe the simple, clean solution is to
simply invoke compute() on all coords as soon as they are assigned to the
DataArray / Dataset?

On 12 Oct 2016 02:18, "Stephan Hoyer" notifications@github.com wrote:

@shoyer commented on this pull request.

Apologies for the delay here -- my comments were stuck as a "pending"
GitHub review.

I am still wondering what the right behavior is for variables used as
indexes. (These can be dask arrays, too.)

I think there is a good case for skipping these variables in .chunk(),
but we probably do want to make indexes still cache as pandas.Index
objects, because otherwise repeated evaluation of dask arrays to build the

index for alignment or indexing gets expensive.

In xarray/core/dataset.py
#1024 (review):
@@ -792,13 +806,19 @@ def chunks(self):
array.
"""
chunks = {}
   for v in self.variables.values():
   for v in self.data_vars.values():
I am concerned about skipping non-data_vars here. Coordinates could still
be chunked, e.g., if they were loaded from a file, or created directly from

dask arrays.

In xarray/core/dataset.py
#1024 (review):
         if v.chunks is not None:
             new_chunks = list(zip(v.dims, v.chunks))
             if any(chunk != chunks[d] for d, chunk in new_chunks
                    if d in chunks):
                 raise ValueError('inconsistent chunks')
             chunks.update(new_chunks)
   if chunks:
I guess this method is inconsistent with Variable.chunks, but it
currently always returns a dict.

I would either skip this change or use something like my version.

In xarray/core/dataset.py
#1024 (review):

@@ -851,6 +871,9 @@ def selkeys(dict_, keys):
return dict((d, dict_[d]) for d in keys if d in dict_)
     def maybe_chunk(name, var, chunks):
       if name not in self.data_vars:
I see your point about performance, but I think that mostly holds true for
indexes. So I would be inclined to adjust this to only skip variables in
self.dims (aka indexes used for alignment).

I am still concerned about skipping coords if they are already dask
arrays. If they are already dask arrays, then .chunk() should probably
adjust their chunks anyways.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1024 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF7OMBL7_F2IV5P04Em8NhPy-K8aNrGZks5qzDVlgaJpZM4KLurN
.

crusaderky · 2016-10-20T10:14:30Z

ping - how do you prefer me to proceed?

shoyer · 2016-10-21T03:36:01Z

I've been thinking about this... Maybe the simple, clean solution is to
simply invoke compute() on all coords as soon as they are assigned to the
DataArray / Dataset?

I'm nervous about eager loading, especially for non-index coordinates. They can have more than one dimension, and thus can contain a lot of data. So potentially eagerly loading non-index coordinates could break existing use cases.

On the other hand, non-index coordinates indeed checked for equality in most xarray operations (e.g., for the coordinate merge in align). So it is indeed useful not to have to recompute them all the time.

Even eagerly loading indexes is potentially problematic, if loading the index values is expensive.

So I'm conflicted:

I like the current caching behavior for coords and indexes
But I also want to avoid implicit conversions from dask to numpy, which is problematic for all the reasons you pointed out earlier

I'm going to start throwing out ideas for how to deal with this:

Option A

Add two new (public?) methods, something like .load_coords() and .load_indexes(). We would then insert calls to these methods at the start of each function that uses coordinates:

.load_indexes(): reindex, reindex_like, align and sel
.load_coords(): merge and anything that calls the functions in core/merge.py (this indirectly includes Dataset.__init__ and Dataset.__setitem__)

Hypothetically, we could even have options for turning this caching systematically on/off (e.g., with xarray.set_options(cache_coords=False, cache_indexes=True): ...).

Your proposal is basically an extreme version of this, where we call .load_coords() immediately after constructing every new object.

Advantages:

It's fairly predictable when caching happens (especially if we opt for calling .load_cords() immediately, as you propose).
Computing variables is all done at once, which is much more performant than what we currently do, e.g., loading variables as needed for .equals() checks in merge_variables one at a time.

Downsides:

Caching is more aggressive than necessary -- we cache indexes even if that coord isn't actually indexed.

Option B

Like Option A, but someone infer the full set of variables that need to be cached (e.g., in a .merge() operation) before it's actually done. This seems hard, but maybe is possible using a variation of merge_variables.

This solves the downside of A, but diminishes the predictability. We're basically back to how things work now.

Option C

Cache dask.array in IndexVariable but not Variable. This preserves performance for repeated indexing, because the hash table behind the pandas.Index doesn't get thrown away.

Advantages:

Much simpler and easier to implement than the alternatives.
Implicit conversions are greatly diminished.

Downsides:

Non-index coordinates get thrown away after being evaluated once. If you're doing lots of operations of the form [ds + other for ds in datasets] where ds and other has conflicting coordinates this would probably make you unhappy.

Option D

Load the contents of an IndexVariable immediately and eagerly. They no longer cache data or use lazy loading.

This has the most predictable performance, but might cause trouble for some edge use cases?

I need to think about this a little more, but right now I am leaning towards Option C or D.

crusaderky · 2016-10-25T18:11:35Z

Hi Stephen,
Thank you for your thinking.
IMHO option D is the cleanest and safest. Could you come up with any
example where it may be problematic?

On 21 Oct 2016 4:36 am, "Stephan Hoyer" notifications@github.com wrote:

I've been thinking about this... Maybe the simple, clean solution is to
simply invoke compute() on all coords as soon as they are assigned to the
DataArray / Dataset?

I'm nervous about eager loading, especially for non-index coordinates.
They can have more than one dimension, and thus can contain a lot of data.
So potentially eagerly loading non-index coordinates could break existing
use cases.

On the other hand, non-index coordinates indeed checked for equality in
most xarray operations (e.g., for the coordinate merge in align). So it is
indeed useful not to have to recompute them all the time.

Even eagerly loading indexes is potentially problematic, if loading the
index values is expensive.

So I'm conflicted:

I like the current caching behavior for coords and indexes

But I also want to avoid implicit conversions from dask to numpy,
which is problematic for all the reasons you pointed out earlier

I'm going to start throwing out ideas for how to deal with this:
Option A

Add two new (public?) methods, something like .load_coords() and
.load_indexes(). We would then insert calls to these methods at the start
of each function that uses coordinates:

.load_indexes(): reindex, reindex_like, align and sel

.load_coords(): merge and anything that calls the functions in
core/merge.py (this indirectly includes Dataset.init and
Dataset.setitem)

Hypothetically, we could even have options for turning this caching
systematically on/off (e.g., with xarray.set_options(cache_coords=False,
cache_indexes=True): ...).

Your proposal is basically an extreme version of this, where we call
.load_coords() immediately after constructing every new object.

Advantages:

It's fairly predictable when caching happens (especially if we opt
for calling .load_cords() immediately, as you propose).

Computing variables is all done at once, which is much more
performant than what we currently do, e.g., loading variables as needed for
.equals() checks in merge_variables one at a time.

Downsides:

Caching is more aggressive than necessary -- we cache indexes even
if that coord isn't actually indexed.

Option B

Like Option A, but someone infer the full set of variables that need to be
cached (e.g., in a .merge() operation) before it's actually done. This
seems hard, but maybe is possible using a variation of merge_variables.

This solves the downside of A, but diminishes the predictability. We're
basically back to how things work now.
Option C

Cache dask.array in IndexVariable but not Variable. This preserves
performance for repeated indexing, because the hash table behind the
pandas.Index doesn't get thrown away.

Advantages:

Much simpler and easier to implement than the alternatives.

Implicit conversions are greatly diminished.

Downsides:

Non-index coordinates get thrown away after being evaluated once. If
you're doing lots of operations of the form [ds + other for ds in
datasets] where ds and other has conflicting coordinates this would
probably make you unhappy.

Option D

Load the contents of an IndexVariable immediately and eagerly. They no
longer cache data or use lazy loading.

This has the most predictable performance, but might cause trouble for

some edge use cases?

I need to think about this a little more, but right now I am leaning
towards Option C or D.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1024 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AF7OMLBh4eDuKRNv0x5HwRie_yaGh0Yzks5q2DMjgaJpZM4KLurN
.

shoyer · 2016-10-25T18:25:30Z

I'm going to ping the mailing list for input, but I think it would be
pretty safe.

On Tue, Oct 25, 2016 at 11:11 AM, crusaderky notifications@github.com
wrote:

Hi Stephen,
Thank you for your thinking.
IMHO option D is the cleanest and safest. Could you come up with any
example where it may be problematic?

On 21 Oct 2016 4:36 am, "Stephan Hoyer" notifications@github.com wrote:

I've been thinking about this... Maybe the simple, clean solution is to
simply invoke compute() on all coords as soon as they are assigned to the
DataArray / Dataset?

I'm nervous about eager loading, especially for non-index coordinates.
They can have more than one dimension, and thus can contain a lot of
data.
So potentially eagerly loading non-index coordinates could break existing
use cases.

On the other hand, non-index coordinates indeed checked for equality in
most xarray operations (e.g., for the coordinate merge in align). So it
is
indeed useful not to have to recompute them all the time.

Even eagerly loading indexes is potentially problematic, if loading the
index values is expensive.

So I'm conflicted:

I like the current caching behavior for coords and indexes

But I also want to avoid implicit conversions from dask to numpy,
which is problematic for all the reasons you pointed out earlier

I'm going to start throwing out ideas for how to deal with this:
Option A

Add two new (public?) methods, something like .load_coords() and
.load_indexes(). We would then insert calls to these methods at the start
of each function that uses coordinates:

.load_indexes(): reindex, reindex_like, align and sel

.load_coords(): merge and anything that calls the functions in
core/merge.py (this indirectly includes Dataset.init and
Dataset.setitem)

Hypothetically, we could even have options for turning this caching
systematically on/off (e.g., with xarray.set_options(cache_coords=False,
cache_indexes=True): ...).

Your proposal is basically an extreme version of this, where we call
.load_coords() immediately after constructing every new object.

Advantages:

It's fairly predictable when caching happens (especially if we opt
for calling .load_cords() immediately, as you propose).

Computing variables is all done at once, which is much more
performant than what we currently do, e.g., loading variables as needed
for
.equals() checks in merge_variables one at a time.

Downsides:

Caching is more aggressive than necessary -- we cache indexes even
if that coord isn't actually indexed.

Option B

Like Option A, but someone infer the full set of variables that need to
be
cached (e.g., in a .merge() operation) before it's actually done. This
seems hard, but maybe is possible using a variation of merge_variables.

This solves the downside of A, but diminishes the predictability. We're
basically back to how things work now.
Option C

Cache dask.array in IndexVariable but not Variable. This preserves
performance for repeated indexing, because the hash table behind the
pandas.Index doesn't get thrown away.

Advantages:

Much simpler and easier to implement than the alternatives.

Implicit conversions are greatly diminished.

Downsides:

Non-index coordinates get thrown away after being evaluated once. If
you're doing lots of operations of the form [ds + other for ds in
datasets] where ds and other has conflicting coordinates this would
probably make you unhappy.

Option D

Load the contents of an IndexVariable immediately and eagerly. They no
longer cache data or use lazy loading.

This has the most predictable performance, but might cause trouble for

some edge use cases?

I need to think about this a little more, but right now I am leaning
towards Option C or D.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1024 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-
auth/AF7OMLBh4eDuKRNv0x5HwRie_yaGh0Yzks5q2DMjgaJpZM4KLurN>
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1024 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1jUaNUCxHlCx86P4JjbhsLA99ZIqks5q3kZYgaJpZM4KLurN
.

benbovy · 2016-11-04T17:29:54Z

Option D seems indeed the cleanest and safest option, but

Even eagerly loading indexes is potentially problematic, if loading the index values is expensive.

I can see use cases where this might happen. For example, It is common for 1, 2 or higher-dimension unstructured meshes that the coordinates x, y, z are arranged as 1-d arrays of length that equals the number of nodes (which can be very high!). See for example ugrid conventions.

I admit that currently xarray is perhaps not very suited for handling unstructured meshes, but IMO it has great potential (especially considering multi-index support) and I'd love to use it here.

shoyer · 2016-11-04T19:19:00Z

I admit that currently xarray is perhaps not very suited for handling unstructured meshes, but IMO it has great potential (especially considering multi-index support) and I'd love to use it here.

Right now, xarray is not going to be great fit for such cases, because we already cache an index in memory for any labeled indexing operations. So at best, you could do something like ds.isel(mesh_edge=slice(int(1e6))). Maybe people already do this?

I doubt very many people are relying on this, though indeed, this would include some users of an array database we wrote at my former employer, which did not support different chunking schemes for different variables (which could make coordinate lookup very slow). CC @ToddSmall in case he has opinions here.

For out-of-core operations with labels on big unstructured meshes, you really need a generalization of the pandas.Index that doesn't need to live in memory (or maybe that lives in memory on some remote server).

benbovy · 2016-11-05T15:37:46Z

we already cache an index in memory for any labeled indexing operations

Oh yes, true!

So at best, you could do something like ds.isel(mesh_edge=slice(int(1e6)))

Indeed, that doesn't look very nice.

For out-of-core operations with labels on big unstructured meshes, you really need a generalization of the pandas.Index that doesn't need to live in memory

From what I intend to do next with xarray, I'd say that extending its support for out-of-core operations to big indexes would be a great feature! I haven't seen yet how dask.Dataframe works internally (including dask.Dataframe.indexand dask.Dataframe.loc), but I guess maybe this could be transposed in some way to the indexing logic in xarray? Though I'm certainly missing a lot of potential issues here... Anyway, I can open a new issue to discuss more about this if you think it's worth it.

shoyer · 2016-11-05T15:53:06Z

Anyway, I can open a new issue to discuss more about this if you think it's worth it.

Yes, please do!

@crusaderky I think we are OK going ahead here with Option D. If we do eventually extend xarray with out of core indexes, that will be done with a separate layer (not in IndexVariable).

crusaderky · 2016-11-05T22:56:27Z

roger that, getting to work :)

shoyer · 2016-11-06T13:01:50Z

Awesome, thanks for your help!

On Sat, Nov 5, 2016 at 6:56 PM crusaderky notifications@github.com wrote:

roger that, getting to work :)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1024 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1mu6Gjv5ehzr-d_3gwKr8PPIgqarks5q7QmcgaJpZM4KLurN
.

Conflicts: xarray/test/test_dask.py

Eagerly cache only IndexVariables (e.g. coords that are not in dims. Coords that are not in dims are chunked and not cached.

crusaderky · 2016-11-12T17:47:10Z

Finished and waiting for code review

shoyer · 2016-11-13T02:10:47Z

xarray/core/variable.py

@@ -1076,10 +1101,16 @@ def __init__(self, dims, data, attrs=None, encoding=None, fastpath=False):
                             type(self).__name__)

    def _data_cached(self):
-        if not isinstance(self._data, PandasIndexAdapter):
-            self._data = PandasIndexAdapter(self._data)
+        # Unlike in Variable._data_cached, always eagerly resolve dask arrays


I thought we wanted to always eagerly load IndexVariable objects into memory without caching at all?

That would suggest we should put something like self._data = PandasIndexAdapter(self._data) in the constructor, and make _data_cached and _data_cast on the subclass dummy methods.

shoyer · 2016-11-13T03:17:18Z

xarray/core/dataset.py

@@ -874,6 +887,9 @@ def selkeys(dict_, keys):
            return dict((d, dict_[d]) for d in keys if d in dict_)

        def maybe_chunk(name, var, chunks):
+            if name in self.dims:


Actually, maybe put this logic in IndexVariable instead? We could define a chunk method that looks like:

def chunk(self, ...): return self.copy(deep=False)

IndexVariables to eagerly load their data into memory (from disk or dask) as soon as they're created

crusaderky · 2016-11-13T18:21:26Z

Changed to cache IndexVariable._data on init. Please review...

shoyer

I have one minor suggestion for a test, but I'll fix that in a follow-on PR. This looks good to me, thanks!

shoyer · 2016-11-14T16:50:16Z

xarray/test/test_backends.py

+            for k, v in actual.variables.items():
+                # IndexVariables are eagerly cached
+                if k in actual.dims:
+                    self.assertTrue(v._in_memory)


This would be slightly simpler just as self.assertEqual(v._in_memory, k in actual.dims)

shoyer · 2016-11-14T16:57:59Z

Thanks for your patience! This is a nice improvement.

I have an idea for a variation that might make for a cleaner (less dask specific) way to handle the remaining caching logic -- I'll add you a reviewer on that PR.

crusaderky · 2016-11-14T19:28:38Z

Happy to contribute!

On 14 Nov 2016 16:58, "Stephan Hoyer" notifications@github.com wrote:

Thanks for your patience! This is a nice improvement.

I have an idea for a variation that might make for a cleaner (less dask
specific) way to handle the remaining caching logic -- I'll add you a
reviewer on that PR.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1024 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AF7OML0y1xCCfg4j0o0OUxMJ8gBpEIB1ks5q-JMYgaJpZM4KLurN
.

kynan · 2016-11-16T00:45:51Z

@crusaderky @shoyer There are still cases where dask arrays are converted to ndarrays where I think they shouldn't be: if you create a Variable with a dask array (e.g. in a custom data store), it gets wrapped into a LazilyIndexedArray at the [end of decode_cf_variable](https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L833). This will subsequently be cast to an ndarray in _data_cast`.

kynan · 2016-11-15T17:35:03Z

xarray/core/variable.py

@@ -277,10 +277,21 @@ def data(self, data):
                "replacement data must match the Variable's shape")
        self._data = data

+    def _data_cast(self):
+        if isinstance(self._data, (np.ndarray, PandasIndexAdapter)):


Should this branch not also apply to dask_array_type?

In fact, if you manually create a Variable with a dask array you'll get a LazilyIndexedArray at this point. Should this not also be kept unchanged?

This is a follow-up to generalize the changes from pydata#1024: - Caching and copy-on-write behavior has been moved to separate array classes that are explicitly used in `open_dataset` to wrap arrays loaded from disk (if `cache=True`). - Dask specific logic has been removed from the caching/loading logic on `xarray.Variable`. - Pickle no longer caches automatically under any circumstances. Still needs tests for the `cache` argument to `open_dataset`, but everything else seems to be working.

shoyer · 2016-11-16T19:30:53Z

@kynan I think this is fixed in #1128, which has a slightly more robust solution.

kynan · 2016-11-16T19:33:08Z

@shoyer Great, thanks, I'll give that a try.

* Disable all caching on xarray.Variable This is a follow-up to generalize the changes from #1024: - Caching and copy-on-write behavior has been moved to separate array classes that are explicitly used in `open_dataset` to wrap arrays loaded from disk (if `cache=True`). - Dask specific logic has been removed from the caching/loading logic on `xarray.Variable`. - Pickle no longer caches automatically under any circumstances. Still needs tests for the `cache` argument to `open_dataset`, but everything else seems to be working. * Fixes for test failures * Fix IndexVariable.load * Made DataStores pickle-able * Add dask.distributed test * Fix failing Python 2 tests * Fix failing test on Windows * Alternative fix for windows issue * yet another attempt to fix windows tests * a different windows fix * yet another attempt to fix test on windows * another attempt at fixing windows * Skip remaining failing test on windows only * Allow file cleanup failures on windows

gimperiale added 6 commits September 25, 2016 19:09

Disabled auto-caching dask arrays when pickling and when invoking the…

f3d74e8

… .values property. Added new method .compute().

Minor tweaks

26a6997

Simplified Dataset.copy() and Dataset.compute()

03fbdd1

Minor cleanup

b04167c

Cleaned up dask test

ca94cc7

Merge branch 'master' into no_dask_resolve

91b8084

# Conflicts: # xarray/test/test_dask.py

crusaderky mentioned this pull request Oct 1, 2016

Disabled auto-caching on dask; new .compute() method #1018

Closed

Integrate no_dask_resolve with dask_broadcast branches

e46b61f

Don't chunk coords

90743f0

shoyer reviewed Oct 3, 2016

View reviewed changes

gimperiale added 2 commits October 6, 2016 15:37

Merge branch 'master' into no_dask_resolve

30fbd8f

# Conflicts: # doc/whats-new.rst

Added performance warning to release notes

ac8e0cb

Fix bug that caused dask array to be computed and then discarded when…

e7f600c

… pickling

shoyer reviewed Oct 12, 2016

View reviewed changes

shoyer mentioned this pull request Nov 4, 2016

MultiIndex serialization to NetCDF #1077

Closed

Merged branch master into no_dask_resolve

2d85d90

benbovy mentioned this pull request Nov 8, 2016

Supporting out-of-core computation/indexing for very large indexes #1094

Open

gimperiale added 3 commits November 10, 2016 11:55

Merged branch master into no_dask_resolve

28f1a6e

Merge branch 'master'

25569df

Conflicts: xarray/test/test_dask.py

Eagerly cache IndexVariables only

27b0916

Eagerly cache only IndexVariables (e.g. coords that are not in dims. Coords that are not in dims are chunked and not cached.

shoyer reviewed Nov 13, 2016

View reviewed changes

shoyer changed the title ~~No dask resolve~~ Disable automatic cache with dask Nov 13, 2016

shoyer reviewed Nov 13, 2016

View reviewed changes

Load IndexVariable.data into memory in init

376200a

IndexVariables to eagerly load their data into memory (from disk or dask) as soon as they're created

shoyer approved these changes Nov 14, 2016

View reviewed changes

shoyer merged commit d66f673 into pydata:master Nov 14, 2016

crusaderky deleted the no_dask_resolve branch November 15, 2016 21:25

kynan reviewed Nov 16, 2016

View reviewed changes

shoyer mentioned this pull request Nov 16, 2016

Remove caching logic from xarray.Variable #1128

Merged

kynan mentioned this pull request Nov 22, 2016

Integration with dask/distributed (xarray backend design) #798

Closed

2 tasks

shoyer mentioned this pull request Jan 2, 2018

Couldn't change values for on-disk datasets opened by open_mfdataset() (xarray > 0.8.2) #1805

Closed

Disable automatic cache with dask #1024

Disable automatic cache with dask #1024

Conversation

crusaderky commented Oct 1, 2016 • edited by shoyer Loading

crusaderky commented Oct 1, 2016 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

shoyer Oct 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Oct 3, 2016 • edited by shoyer Loading

crusaderky commented Oct 6, 2016

crusaderky commented Oct 6, 2016

shoyer left a comment

Choose a reason for hiding this comment

shoyer Oct 12, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Oct 12, 2016

index for alignment or indexing gets expensive.

dask arrays.

I would either skip this change or use something like my version.

crusaderky commented Oct 20, 2016

shoyer commented Oct 21, 2016

Option A

Option B

Option C

Option D

crusaderky commented Oct 25, 2016

some edge use cases?

shoyer commented Oct 25, 2016

some edge use cases?

benbovy commented Nov 4, 2016

shoyer commented Nov 4, 2016

benbovy commented Nov 5, 2016

shoyer commented Nov 5, 2016

crusaderky commented Nov 5, 2016

shoyer commented Nov 6, 2016

crusaderky commented Nov 12, 2016

shoyer Nov 13, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Nov 13, 2016

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 14, 2016

crusaderky commented Nov 14, 2016

kynan commented Nov 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 16, 2016

kynan commented Nov 16, 2016

crusaderky commented Oct 1, 2016 •

edited by shoyer

Loading

crusaderky commented Oct 1, 2016 •

edited

Loading

shoyer Oct 3, 2016 •

edited

Loading

crusaderky commented Oct 3, 2016 •

edited by shoyer

Loading

shoyer Oct 12, 2016 •

edited

Loading

shoyer Nov 13, 2016 •

edited

Loading