WIP: Aggregate received data into a large allocation #3453

jakirkham · 2020-02-06T20:35:27Z

Note: This is still being worked. So there may be bugs, test failures, bad behavior, etc. At this stage, all of these should be expected. You have been warned. 😉

As we have seen some issues due to churn and fragmentation from small allocations, this tries to allocate one large buffer up front before receiving. This buffer is then split up into small views onto that larger allocation. Once this setup is done we receive into all of these views. The views are then passed to deserialization much as the smaller buffers were before.

We take views onto the data so send and deserialization methods can work on data they are familiar with. A follow-up to this work might be to figure out how deserialization and receive can work on larger buffers to start with to avoid this split of the buffer into views. Though the intent is to not require this at this stage and only follow-up on it as needed later.

To help illustrate the motivation of this this change. Assume we are receiving a Dataframe. Previously we would create multiple allocations for each column along with their relevant internal data (masks, indices, etc.). With this change, we should make one host (and one device) allocation for the whole Dataframe and then fill out its content.

As a bonus of performing this allocation up front and parceling it into the relevant views, we are able to get rid of the old looping behavior around each receive and instead hand all of these to asyncio. This could potentially (though this yet to be verified) get more out of the available bandwidth and benefit from things like non-blocking sends.

Have added this change to both TCP and UCX. The former to aid in testing and discussion. The latter is of interest for improving performance in that use case. Though hopefully both will get a boost from this work.

As we need to index into the object and `DeviceBuffer` currently lacks a way to index it, go ahead and coerce `DeviceBuffer`s to Numba `DeviceNDArray`s. This way we can be sure we will be able to index it when needed.

mrocklin · 2020-02-07T14:53:32Z

Interesting idea. I would be curious to see what performance differences are like. My hope would be that this is not necessary with TCP, and wouldn't have much of an effect with UCX if we're using RMM, but I wouldn't be surprised if myintuition was wrong here.

jakirkham · 2020-02-07T19:33:24Z

The issue is RMM uses CNMEM under-the-hood, which is known to have issues freeing lots of memory (like small chunks). Keith mentioned this and added some numbers to back this up. So this effectively cuts down on the number of times free needs to be called per send, which should improve overall memory allocation performance. Of course some of this occurs on the critical path, but it also happens in other places too 😉

As to TCP, it really depends on how the Python memory manager works (given we are using bytearray). That said, we might want to change that to optionally use NumPy to benefit from the HUGEPAGES work ( numpy/numpy#14216 ), at which point we would want to know how NumPy handles memory management. My guess is it would be biased towards handling large allocations efficiently, but I could always be wrong about this 🙂 In any event this is largely orthogonal and should be saved for a separate discussion thread (feel free to open one if it is of interest). For now TCP is just handy for fixing bugs in my approach 😄

I can't say at this point. Still sorting out bugs.

It's worth noting that this may affect things outside of communication itself as it affects how allocations are done, which will affect other allocations that may not be in the communication path (like spilling or creating intermediates for example). So a simple benchmark UCX with/without this change is likely to miss many things or possibly focus on the wrong things. This is all conjecture, but do want to make sure we are aware of the forrest and not just one tree in it. 🙂

To summarize, the idea of the work here is to then cutdown the number of times alloc/free are called by some factor and then see how overall workflows are impacted.

Anyways probably wouldn't follow this too closely for now as it is still WIP. Trying not to distract people while things are still in an early stage 😉

mrocklin · 2020-02-07T20:44:08Z

To summarize, the idea of the work here is to then cutdown the number of

times alloc/free are called by some factor and then see how overall workflows are impacted. +1. I look forward to seeing how this performs.

…

On Fri, Feb 7, 2020 at 11:33 AM jakirkham ***@***.***> wrote: The issue is RMM uses CNMEM under-the-hood, which is know to have issues freeing lots of memory (like small chunks). Keith mentioned this <rapidsai/ucx-py#402 (comment)> and added some numbers to back this up <rapidsai/ucx-py#402 (comment)>. So this effectively cuts down on the number of times free needs to be called per send, which should improve overall memory allocation performance. Of course some of this occurs on the critical path, but it also happens in other places too 😉 As to TCP, it really depends on how the Python memory manager works (given we are using bytearray). That said, we might want to change that to optionally use NumPy to benefit from the HUGEPAGES work ( numpy/numpy#14216 <numpy/numpy#14216> ), at which point we would want to know how NumPy handles memory management. My guess is it would be biased towards handling large allocations efficiently, but I could always be wrong about this 🙂 In event this is largely orthogonal and should be saved for a separate discussion thread (feel free to open one if it is of interest). For now TCP is just handy for fixing bugs in my approach 😄 I can't say at this point. Still sorting out bugs. It's worth noting that this may affect things outside of communication itself as it affects how allocations are done, which will affect other allocations that may not be in the communication path (like spilling or creating intermediates for example). So a simple benchmark UCX with/without this change is likely to miss many things or possibly focus on the wrong things. This is all conjecture, but do want to make sure we are aware of the forrest and not just one tree in it. 🙂 To summarize, the idea of the work here is to then cutdown the number of times alloc/free are called by some factor and then see how overall workflows are impacted. Anyways probably wouldn't follow this too closely for now as it is still WIP. Trying not to distract people while things are still in an early stage 😉 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3453?email_source=notifications&email_token=AACKZTAYATCBGXUMHMITA73RBWZQJA5CNFSM4KRDSCPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELEIYSY#issuecomment-583568459>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHQRZQSCIKKCQQHPTTRBWZQJANCNFSM4KRDSCPA> .

mrocklin · 2020-02-16T17:24:58Z

Checking in here. Is this still an active effort? No worries either way, I'm just going through old PRs and seeing what we can close out.

jakirkham · 2020-02-16T19:27:43Z

I have had my hands full, but yes this remains important.

jakirkham · 2020-04-21T02:04:12Z

Closing out in favor of PR ( #3732 ).

jakirkham · 2021-11-08T19:11:00Z

Somewhat related to the approach that was tried in PR ( #5487 ) except that looked at write in addition to read

jakirkham added 5 commits February 6, 2020 11:31

Add as_cuda_array

5e935f8

Coerce DeviceBuffer to Numba DeviceNDArray

bd3b0c9

As we need to index into the object and `DeviceBuffer` currently lacks a way to index it, go ahead and coerce `DeviceBuffer`s to Numba `DeviceNDArray`s. This way we can be sure we will be able to index it when needed.

Note that we always require Numba

e3cbf12

Aggregate allocations

9040498

Handle host + dev frames separately then merge

ce5d468

jakirkham mentioned this pull request Feb 7, 2020

Dask-cudf multi partition merge slows down with ucx rapidsai/ucx-py#402

Closed

jakirkham mentioned this pull request Mar 4, 2020

[FEA] Function to "pack" a table into a single buffer rapidsai/cudf#3793

Closed

Import from tlz for optional cytoolz support

59ca551

quasiben mentioned this pull request Mar 20, 2020

TCP simplify receiving frames in comms #3599

Closed

jakirkham closed this Apr 21, 2020

jakirkham deleted the agg_recv branch April 21, 2020 02:04

jakirkham mentioned this pull request Nov 18, 2021

[FEA] pack/unpack functions to merge/split (multiple) device_buffer(s) rapidsai/cudf#9726

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Aggregate received data into a large allocation #3453

WIP: Aggregate received data into a large allocation #3453

jakirkham commented Feb 6, 2020

mrocklin commented Feb 7, 2020

jakirkham commented Feb 7, 2020 •

edited

Loading

mrocklin commented Feb 7, 2020 via email

mrocklin commented Feb 16, 2020

jakirkham commented Feb 16, 2020

jakirkham commented Apr 21, 2020

jakirkham commented Nov 8, 2021

WIP: Aggregate received data into a large allocation #3453

WIP: Aggregate received data into a large allocation #3453

Conversation

jakirkham commented Feb 6, 2020

mrocklin commented Feb 7, 2020

jakirkham commented Feb 7, 2020 • edited Loading

mrocklin commented Feb 7, 2020 via email

mrocklin commented Feb 16, 2020

jakirkham commented Feb 16, 2020

jakirkham commented Apr 21, 2020

jakirkham commented Nov 8, 2021

jakirkham commented Feb 7, 2020 •

edited

Loading