fix distributed writes #1793

jhamman · 2017-12-19T22:24:41Z

Closes Writing directly to a netCDF file while using distributed #1464
Tests added
Tests passed
Passes git diff upstream/master **/*py | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

Right now, I've just modified the dask distributed integration tests so we can all see the failing tests.

I'm happy to push this further but I thought I'd see if either @shoyer or @mrocklin have an idea where to start?

shoyer · 2017-12-19T22:29:47Z

mrocklin · 2017-12-19T22:39:43Z

The zarr test seems a bit different. I think your issue here is that you are trying to use synchronous API with the async test harness. I've changed your test and pushed to your branch (hope you don't mind). Relevant docs are here: http://distributed.readthedocs.io/en/latest/develop.html#writing-tests

Async testing is nicer in many ways, but does require you to be a bit familiar with the async/tornado API. I also suspect that operations like to_zarr really aren't yet async friendly.

…buted_writes

jhamman · 2018-01-11T17:51:30Z

xarray/backends/netCDF4_.py

+        with self.datastore.ensure_open(autoclose=True):
+            data = self.get_array()
+            data[key] = value
+


@shoyer, is this what you were describing in #1464 (comment)

Yes, this looks right to me.

jhamman · 2018-01-11T21:37:43Z

@mrocklin -

I have a test failing here with a familiar message.

E       TypeError: 'Future' object is not iterable

We saw this last week when debugging some pangeo things. Can you remind me what our solution was?

mrocklin · 2018-01-12T00:23:09Z

I don't know. I would want to look at the fail case locally. I can try to do this near term, no promises though :/

…buted_writes

jhamman · 2018-01-25T01:14:05Z

I've just taken another swing at this and come up empty. I open to ideas in the following areas:

scipy backend is failing to roundtrip a length 1 datetime array: https://travis-ci.org/pydata/xarray/jobs/333068098#L4504
scipy, netcdf4, and h5netcdf backends are all failing inside dask-distributed: https://travis-ci.org/pydata/xarray/jobs/333068098#L4919

The good news here is that only 8 tests are failing after applying the array wrapper so I suspect we're quite close. I'm hoping @shoyer may have some ideas on (1) since I think he had implemented some scipy workarounds in the past. @mrocklin, I'm hoping you can point me in the right direction.

All of these tests are reproducible locally.

(BTW, I have a use case that is going to need this functionality so I'm personally motivated to see it across the finish line)

shoyer · 2018-01-25T02:58:53Z

xarray/backends/scipy_.py

@@ -55,6 +55,18 @@ def __getitem__(self, key):
            copy = self.datastore.ds.use_mmap
            return np.array(data, dtype=self.dtype, copy=copy)

+    def __setitem__(self, key, value):
+        with self.datastore.ensure_open(autoclose=True):
+            data = self.get_array()


This needs to be self.datastore.ds.variables[self.variable_name] (a netcdf_variable object), not self.get_array(), which returns the .data (a numpy array). You can't enlarge a numpy array, but can enlarge a scipy netcdf variable.

(This manifests itself in writing a netcdf file with a time dimension of length 0. Xarray then crashes when attempting to a decode a length 0 time variable, which is an unrelated bug.)

Thanks @shoyer - this appears to fix the scipy issue.

rabernat · 2018-01-25T04:49:37Z

Kudos for pushing this forward. I don't have much help to offer, but I wanted to recognize your effort...this is hard stuff!.

shoyer · 2018-01-25T17:57:29Z

Has anyone successfully used dask.array.store() with the distributed scheduler?

mrocklin · 2018-01-25T17:59:34Z

I can take a look at the future not iterable issue sometime tomorrow.

Has anyone successfully used dask.array.store() with the distributed scheduler?

My guess is that this would be easy with a friendly storage target. I'm not sure though. cc @jakirkham who has been active on this topic recently.

jakirkham · 2018-01-25T20:29:58Z

Yep, using dask.array.store regularly with the distributed scheduler both on our cluster and in a local Docker image for testing. Am using Zarr Arrays as the targets for store to write to. Basically rechunk the data to match the chunking selected for the Zarr Array and then write out in parallel lock-free.

Our cluster uses NFS for things like one's home directory. So these are accessible across nodes. Also there are other types of storage available that are a bit faster and still remain accessible across nodes. So these work pretty well.

jhamman · 2018-01-26T01:43:52Z

Yes, the zarr backend here in xarray is also using dask.array.store and seems to work with distributed just fine.

rabernat · 2018-01-26T02:05:19Z

I have definitely used the distributed scheduler with dask.array.store both via Zarr and via a custom store class I wrote: https://gist.github.com/rabernat/e54755e7de4eb5a93cc4e7f9f903e3cc

But I cannot recall if I ever got it to work with netCDF.

…buted_writes

jhamman · 2018-01-28T23:31:15Z

xref: #798 and dask/dask#2488 which are both seem to be relevant to this discussion.

I'm also remembering @pwolfram was quite involved with the original distributed integration so pinging him to see if he is interested in this.

shoyer · 2018-02-02T02:17:34Z

Looking into this a little bit, this looks like a dask-distributed bug to me. Somehow Client.get() is returning a tornado.concurrent.Future object, even though sync=True.

…anup in to_netcdf

jhamman · 2018-02-22T15:56:12Z

xarray/backends/common.py

-        if self._autoclose and not self._isopen:
+
+        if autoclose is None:
+            autoclose = self._autoclose


This could probably use some additional thinking.

…buted_writes

…y into feature/distributed_writes

after merge

jhamman · 2018-02-28T00:38:59Z

I've added some additional tests and cleaned up the implementation a bit. I'd like to get reviews from a few folks and hopefully get this merged later this week.

shoyer · 2018-02-28T01:33:21Z

xarray/backends/api.py

+
+    # Question: Should we be dropping one of these two locks when they are they
+    # are basically the same. For instance, when using netcdf4 and dask is not
+    # installed, locks will be [threading.Lock(), threading.Lock()]


I think this is harmless, as long as these are different lock instances.

On the other hand, something like CombinedLock([lock, lock]) would never be satisfied because it's impossible to unlock a lock twice.

shoyer · 2018-02-28T01:34:48Z

xarray/backends/api.py

+        # per file lock
+        # Dask locks take a name argument (e.g. filename)
+        locks.append(SchedulerLock(path_or_file))
+    except TypeError:


It would be less error prone to pass the name to get_scheduler_lock and have it return a lock instance.

shoyer · 2018-02-28T01:36:40Z

xarray/backends/common.py

+        return any(lock.locked for lock in self.locks)
+
+    def __repr__(self):
+        return "CombinedLock(%s)" % [repr(lock) for lock in self.locks]


Nit: I think you could equivalently substitute "CombinedLock(%r)" % list(self.locks) here.

shoyer · 2018-02-28T01:40:21Z

xarray/tests/test_distributed.py

+
+
+# Does this belong elsewhere?
+os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"


It would be better to use a context manager or decorator on the test, something along the lines of https://stackoverflow.com/questions/2059482/python-temporarily-modify-the-current-processs-environment

good idea. I think we can actually do this with pytest/monkeypatch.

rabernat

This was pretty heavy duty! Nice work Joe!

rabernat · 2018-02-28T14:05:19Z

doc/whats-new.rst

@@ -38,6 +38,13 @@ Documentation
 Enhancements
 ~~~~~~~~~~~~

+- Support for writing netCDF files from xarray datastores (scipy and netcdf4 only)


Should this be "to xarray datastores"?

Nevermind, I think it makes sense as is.

Maybe "Support for writing xarray datasets to netCDF files..."

It's nice to see you were able to get this to work with SciPy!

rabernat · 2018-02-28T14:16:18Z

xarray/backends/zarr.py

@@ -356,6 +356,8 @@ def prepare_variable(self, name, variable, check_encoding=False,

        fill_value = _ensure_valid_fill_value(attrs.pop('_FillValue', None),
                                              dtype)
+        if variable.encoding == {'_FillValue': None} and fill_value is None:
+            variable.encoding = {}


Does this fix a specific issue?

IIRC, this crept in from #1869.

Could this fix #1955?

…buted_writes

jhamman · 2018-03-07T03:17:51Z

xarray/tests/test_distributed.py


 import pytest

 dask = pytest.importorskip('dask')  # isort:skip
 distributed = pytest.importorskip('distributed')  # isort:skip

 from dask import array
+from dask.distributed import Client, Lock


@mrocklin - would you mind looking at the test implementation I have here and let us know if you see anything that would be causing the default (global) dask scheduler to be permanently overridden. In #1971, I pointed to some test failures that appear coming from a unexpected scheduler.

Everything seems fine to me. What is dask._config.globals['get']? Is it the get method of a client? To debug you might consider giving each of your clients a name

with Client(s['address'], name='test-foo') as client

and then seeing which one isn't getting cleaned up? You can also try distributed.client.default_client(). We clean things up in the __exit__() call though, so as long as you're using context managers or @gen_cluster everything should work fine.

jhamman · 2018-03-08T00:32:13Z

xarray/backends/common.py

+    """
+
+    def __init__(self, locks):
+        self.locks = tuple(set(locks))  # remove duplicates


previous test failures were having trouble in __enter__ when iterating over a set of locks. casting to list/tuple seems to have resolved that.

Huh. I would if non-deterministic ordering of set iteration (e.g., after serialization/unserialization) contributed to that.

…buted_writes

jhamman · 2018-03-08T01:26:27Z

All the test are passing here. I would appreciate another round of reviews.

@shoyer - all of your previous comments have been addressed.

shoyer

This all looks good to me now.

Nice work tracking down everything that could go wrong here!

shoyer · 2018-03-08T02:28:34Z

xarray/backends/common.py

+    """
+
+    def __init__(self, locks):
+        self.locks = tuple(set(locks))  # remove duplicates


Huh. I would if non-deterministic ordering of set iteration (e.g., after serialization/unserialization) contributed to that.

jhamman · 2018-03-09T04:31:05Z

Any final comments on this? If not, I'll probably merge this in the next day or two.

fujiisoup · 2018-03-13T07:28:26Z

xarray/backends/netCDF4_.py

-        return nc4_var, variable.data
+        target = NetCDF4ArrayWrapper(name, self)
+
+        return target, variable.data


@jhamman
This is too late, but I think nc4_var is never used. Is it correct?

@fujiisoup - it is used in line 405 (nc4_var.setncattr(k, v)).

distributed tests that write dask arrays

63abe7f

Change zarr test to synchronous API

1952173

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

dd4bfcf

…buted_writes

jhamman added this to the 0.10.1 milestone Jan 2, 2018

jhamman mentioned this pull request Jan 11, 2018

v0.10.1 Release #1821

Closed

5 tasks

Joseph Hamman added 2 commits January 11, 2018 09:19

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

5c7f94c

…buted_writes

initial go at __setitem__ on array wrappers

9e70a3a

jhamman commented Jan 11, 2018

View reviewed changes

fixes for scipy

ec67a54

cchwala mentioned this pull request Jan 17, 2018

HDF5 error when working with compressed NetCDF files and the dask multiprocessing scheduler #1836

Open

Joseph Hamman added 3 commits January 24, 2018 08:40

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

2a4faa4

…buted_writes

cleanup after merging with upstream/master

5497ad1

needless duplication of tests to work around pytest bug

c2f5bb8

shoyer reviewed Jan 25, 2018

View reviewed changes

use netcdf_variable instead of get_array()

5344fe8

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

7cbd2e5

…buted_writes

use synchronous dask.distributed test harness

49366bf

Joseph Hamman and others added 2 commits February 21, 2018 20:58

update tests to only expect netcdf4 to work, docstrings, and some cle…

00156c3

…anup in to_netcdf

Fixing style errors.

3dcfac5

jhamman commented Feb 22, 2018

View reviewed changes

Joseph Hamman added 6 commits February 27, 2018 14:23

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

29edaa7

…buted_writes

Merge branch 'feature/distributed_writes' of github.com:jhamman/xarra…

61ee5a8

…y into feature/distributed_writes

fix imports

91f3c6a

after merge

fix more import bugs

5cb91ba

update docs

2b97d4f

fix for pynio

2dc514f

shoyer reviewed Feb 28, 2018

View reviewed changes

rabernat reviewed Feb 28, 2018

View reviewed changes

Joseph Hamman added 2 commits February 28, 2018 10:19

cleanup locks and use pytest monkeypatch for environment variable

eff0161

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

5290484

…buted_writes

jhamman mentioned this pull request Mar 7, 2018

Should we be testing against multiple dask schedulers? #1971

Closed

jhamman commented Mar 7, 2018

View reviewed changes

fix failing test using combined lock

c855284

jhamman commented Mar 8, 2018

View reviewed changes

Merge branch 'master' of github.com:pydata/xarray into feature/distri…

3c2ffbf

…buted_writes

shoyer approved these changes Mar 8, 2018

View reviewed changes

shoyer mentioned this pull request Mar 9, 2018

0.10.2 release #1975

Closed

3 tasks

jhamman mentioned this pull request Mar 10, 2018

CFTimeIndex #1252

Merged

4 tasks

rabernat mentioned this pull request Mar 10, 2018

Tests are failing caused by zarr 2.2.0 #1979

Closed

jhamman merged commit 2f590f7 into pydata:master Mar 10, 2018

jhamman deleted the feature/distributed_writes branch March 10, 2018 15:43

fujiisoup reviewed Mar 13, 2018

View reviewed changes

shoyer mentioned this pull request Aug 6, 2018

Consolidating all tasks that write to a file on a single worker dask/distributed#2163

Open

scharlottej13 mentioned this pull request Sep 16, 2024

Updates to Dask page in Xarray docs #9495

Open

4 tasks



		# Does this belong elsewhere?
		os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"

fix distributed writes #1793

fix distributed writes #1793

Conversation

jhamman commented Dec 19, 2017 • edited Loading

shoyer commented Dec 19, 2017

mrocklin commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Jan 11, 2018

mrocklin commented Jan 12, 2018

jhamman commented Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Jan 25, 2018

shoyer commented Jan 25, 2018

mrocklin commented Jan 25, 2018

jakirkham commented Jan 25, 2018

jhamman commented Jan 26, 2018

rabernat commented Jan 26, 2018 • edited Loading

jhamman commented Jan 28, 2018

shoyer commented Feb 2, 2018

Choose a reason for hiding this comment

jhamman commented Feb 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Mar 8, 2018

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Mar 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Dec 19, 2017 •

edited

Loading

jhamman commented Jan 25, 2018 •

edited

Loading

rabernat commented Jan 26, 2018 •

edited

Loading