Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Aggregate received data into a large allocation #3453

Closed
wants to merge 6 commits into from

Conversation

jakirkham
Copy link
Member

Note: This is still being worked. So there may be bugs, test failures, bad behavior, etc. At this stage, all of these should be expected. You have been warned. 😉


As we have seen some issues due to churn and fragmentation from small allocations, this tries to allocate one large buffer up front before receiving. This buffer is then split up into small views onto that larger allocation. Once this setup is done we receive into all of these views. The views are then passed to deserialization much as the smaller buffers were before.

We take views onto the data so send and deserialization methods can work on data they are familiar with. A follow-up to this work might be to figure out how deserialization and receive can work on larger buffers to start with to avoid this split of the buffer into views. Though the intent is to not require this at this stage and only follow-up on it as needed later.

To help illustrate the motivation of this this change. Assume we are receiving a Dataframe. Previously we would create multiple allocations for each column along with their relevant internal data (masks, indices, etc.). With this change, we should make one host (and one device) allocation for the whole Dataframe and then fill out its content.

As a bonus of performing this allocation up front and parceling it into the relevant views, we are able to get rid of the old looping behavior around each receive and instead hand all of these to asyncio. This could potentially (though this yet to be verified) get more out of the available bandwidth and benefit from things like non-blocking sends.

Have added this change to both TCP and UCX. The former to aid in testing and discussion. The latter is of interest for improving performance in that use case. Though hopefully both will get a boost from this work.

As we need to index into the object and `DeviceBuffer` currently lacks a
way to index it, go ahead and coerce `DeviceBuffer`s to Numba
`DeviceNDArray`s. This way we can be sure we will be able to index it
when needed.
@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2020

Interesting idea. I would be curious to see what performance differences are like. My hope would be that this is not necessary with TCP, and wouldn't have much of an effect with UCX if we're using RMM, but I wouldn't be surprised if myintuition was wrong here.

@jakirkham
Copy link
Member Author

jakirkham commented Feb 7, 2020

The issue is RMM uses CNMEM under-the-hood, which is known to have issues freeing lots of memory (like small chunks). Keith mentioned this and added some numbers to back this up. So this effectively cuts down on the number of times free needs to be called per send, which should improve overall memory allocation performance. Of course some of this occurs on the critical path, but it also happens in other places too 😉

As to TCP, it really depends on how the Python memory manager works (given we are using bytearray). That said, we might want to change that to optionally use NumPy to benefit from the HUGEPAGES work ( numpy/numpy#14216 ), at which point we would want to know how NumPy handles memory management. My guess is it would be biased towards handling large allocations efficiently, but I could always be wrong about this 🙂 In any event this is largely orthogonal and should be saved for a separate discussion thread (feel free to open one if it is of interest). For now TCP is just handy for fixing bugs in my approach 😄

I can't say at this point. Still sorting out bugs.

It's worth noting that this may affect things outside of communication itself as it affects how allocations are done, which will affect other allocations that may not be in the communication path (like spilling or creating intermediates for example). So a simple benchmark UCX with/without this change is likely to miss many things or possibly focus on the wrong things. This is all conjecture, but do want to make sure we are aware of the forrest and not just one tree in it. 🙂

To summarize, the idea of the work here is to then cutdown the number of times alloc/free are called by some factor and then see how overall workflows are impacted.

Anyways probably wouldn't follow this too closely for now as it is still WIP. Trying not to distract people while things are still in an early stage 😉

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2020 via email

@mrocklin
Copy link
Member

Checking in here. Is this still an active effort? No worries either way, I'm just going through old PRs and seeing what we can close out.

@jakirkham
Copy link
Member Author

I have had my hands full, but yes this remains important.

@jakirkham
Copy link
Member Author

Closing out in favor of PR ( #3732 ).

@jakirkham jakirkham closed this Apr 21, 2020
@jakirkham jakirkham deleted the agg_recv branch April 21, 2020 02:04
@jakirkham
Copy link
Member Author

Somewhat related to the approach that was tried in PR ( #5487 ) except that looked at write in addition to read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants