Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UCX Comm #2591

Merged
merged 85 commits into from
May 31, 2019
Merged

Add UCX Comm #2591

merged 85 commits into from
May 31, 2019

Conversation

quasiben
Copy link
Member

@quasiben quasiben commented Apr 2, 2019

This PR brings in much of the work done by @TomAugspurger with ucx/ucx-py with @mrocklin's help I added a small change to cudf protocol handling and general cleanup.

distributed/comm/ucx.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Member

I think #2565 (comment) has the motivation, but I don’t recall details. Not sure about tests.

OK, for now I've turned off compression on all cuda data. That should stop us from splitting up large frames. It looks like currently there is a limit in ucx-py that keeps us under 2**31 bytes. At first this seems to be limited by the int type of the Message._length attribute. Changing that to long causes a segfault, so I'm probably missing something upstream in UCX.

I've also added logic to not call ensure_bytes and b''.join if there is only one element in that list.

distributed/core.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Member

This now works-ish (at least when combined with some of the work in rapidsai/dask-cuda#46). I would like to get this to a point where we could merge it somewhat quickly.

I don't mind things being a little rough if they are well isolated into files that aren't in the mainline code path (files like ucx.py, cuda.py, cudf.py and so on).

There are a few TODO's left in the actual logic that I suspect are left by @TomAugspurger . Is this something that you can look into this week Tom to see if they are still necessary?

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just gave a quick look through. I'm not sure if I'll have time this week to actually verify the TODOs yet; need to check on where we are for the next pandas release first.

General question: do we want users to provide the prefix ucx:// or ucp://?

distributed/comm/tests/test_ucx.py Outdated Show resolved Hide resolved
# Workaround for hanging test in
# pytest distributed/comm/tests/test_ucx.py::test_comm_objs -vs --count=2
# on the second time through.
ucp._libs.ucp_py.reader_added = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akshay-Venkatesh do you recall if this was resolved? Is it the same as https://github.com/Akshay-Venkatesh/ucx-py/issues/69, or different?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I tested this yesterday and the issue hasn't been resolved. This still has to be fixed.

distributed/comm/tests/test_ucx.py Outdated Show resolved Hide resolved
distributed/comm/tests/test_ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/core.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Member

Thanks for the feedback @TomAugspurger . I think I can handle everything, I mostly wanted to get your thoughts on some of the comments. Thanks!

@mrocklin mrocklin changed the title [WIP] UCX work UCX work May 30, 2019
@mrocklin mrocklin changed the title UCX work Add UCX Comm May 30, 2019
@mrocklin
Copy link
Member

I've removed the WIP label. Review appreciated. I think that this is safe to go in.

@mrocklin
Copy link
Member

OK, I'm merging this tomorrow if there are no further comments.

@TomAugspurger
Copy link
Member

TomAugspurger commented May 31, 2019 via email

@mrocklin mrocklin merged commit a8504d6 into dask:master May 31, 2019
@TomAugspurger
Copy link
Member

🎉

@mrocklin
Copy link
Member

Thank you @TomAugspurger, @quasiben, and @Akshay-Venkatesh for working on this. I'm sure that there is still plenty more to do here, but it will be nice to have this in master.

muammar added a commit to muammar/distributed that referenced this pull request Jun 12, 2019
* upstream/master: (58 commits)
  Add unknown pytest markers (dask#2764)
  Delay lookup of allowed failures. (dask#2761)
  Change address -> worker in ColumnDataSource for nbytes plot (dask#2755)
  Remove module state in Prometheus Handlers (dask#2760)
  Add stress test for UCX (dask#2759)
  Add nanny logs (dask#2744)
  Move some of the adaptive logic into the scheduler (dask#2735)
  Add SpecCluster.new_worker_spec method (dask#2751)
  Worker dashboard fixes (dask#2747)
  Add async context managers to scheduler/worker classes (dask#2745)
  Fix the resource key representation before sending graphs (dask#2716) (dask#2733)
  Allow user to configure whether workers are daemon. (dask#2739)
  Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737)
  Add Experimental UCX Comm (dask#2591)
  Close nannies gracefully (dask#2731)
  add kwargs to progressbars (dask#2638)
  Add back LocalCluster.__repr__. (dask#2732)
  Move bokeh module to dashboard (dask#2724)
  Close clusters at exit (dask#2730)
  Add SchedulerPlugin TaskState example (dask#2622)
  ...
muammar added a commit to muammar/distributed that referenced this pull request Jul 18, 2019
* upstream/master: (43 commits)
  Add unknown pytest markers (dask#2764)
  Delay lookup of allowed failures. (dask#2761)
  Change address -> worker in ColumnDataSource for nbytes plot (dask#2755)
  Remove module state in Prometheus Handlers (dask#2760)
  Add stress test for UCX (dask#2759)
  Add nanny logs (dask#2744)
  Move some of the adaptive logic into the scheduler (dask#2735)
  Add SpecCluster.new_worker_spec method (dask#2751)
  Worker dashboard fixes (dask#2747)
  Add async context managers to scheduler/worker classes (dask#2745)
  Fix the resource key representation before sending graphs (dask#2716) (dask#2733)
  Allow user to configure whether workers are daemon. (dask#2739)
  Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737)
  Add Experimental UCX Comm (dask#2591)
  Close nannies gracefully (dask#2731)
  add kwargs to progressbars (dask#2638)
  Add back LocalCluster.__repr__. (dask#2732)
  Move bokeh module to dashboard (dask#2724)
  Close clusters at exit (dask#2730)
  Add SchedulerPlugin TaskState example (dask#2622)
  ...
Comment on lines +179 to 181
header = bytes(header)
if header:
header = msgpack.loads(header, use_list=False, **msgpack_opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall why this was needed? Was this due to the if header line? Did msgpack.dumps need this? Or was it due to something else like potentially unusual types being passed in for header (like maybe a NumPy array)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I guess this is explained here ( 44c1d5c ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants