[FEA] Function to "pack" a table into a single buffer #3793

jrhemstad · 2020-01-15T21:58:01Z

Is your feature request related to a problem? Please describe.

I would like to be able to create a copy of a table_view where the storage for the constituent column is "packed" into a single allocation.

Describe the solution you'd like

An API something like this:

std::pair<rmm::device_buffer, table_view> pack(table_view const& t);

Where the returned device_buffer contains a copy of the contents of t in a single allocation and the returned table_view points into that buffer.

Additional context

This functionality is effectively already developed for contiguous_split in the alloc_and_copy function. The ask is just to wrap and expose this as a public API.

The text was updated successfully, but these errors were encountered:

felipeblazing · 2020-01-15T22:08:08Z

We would love to have this. Shouldn't we have somethin that is alse like

std::unique_ptr<table> unpack(const rmm::device_buffer & buffer, unsigned long long size);

jrhemstad · 2020-01-15T22:50:46Z

We would love to have this. Shouldn't we have somethin that is alse like
std::unique_ptr<table> unpack(const rmm::device_buffer & buffer, unsigned long long size); 

I think it'd be better to return a table_view that points into the returned device_buffer. I updated the issue.

Then if you want to create a non-packed table, from a packed one, you could do this:

table t;

[buffer, view] = pack(t);

table t(view); // copy ctor table from view will create a new `table` that copies from the `view`

jakirkham · 2020-02-10T20:19:31Z

This would also be useful from the UCX side as we could serialize a larger contiguous piece of memory and then unpack it later so that individual columns could be returned to RMM when no longer needed.

cc @quasiben

jrhemstad · 2020-03-04T13:44:53Z

@jakirkham there was some offline conversation with @jlowe @kkraus14 @nvdbaranec about how we'd serialize a "packed" table for communication.

The gist of it is, when you "pack" a table, you get a device_buffer and a table_view that points into that device_buffer. If you were to simply communicate the device_buffer and table_view as-is, the pointers in the table_view on the receiving side wouldn't be correct because they'd refer to memory locations on the sending side.

It's pretty easy to update the pointers on the receiving side if you know what the base pointers were on the sending side. It's just a bunch of pointer offset arithmetic.

We're trying to figure out a common set of serialize/deserialize primitives for libcudf that both Dask and Spark can use. Could you weigh in and help to define that the serialized layout should look like?

quasiben · 2020-03-04T14:49:38Z

@madsbk do you have opinions here as well on things ucx-py could be/should be doing ?

madsbk · 2020-03-04T15:33:30Z

@madsbk do you have opinions here as well on things ucx-py could be/should be doing ?

I don't think that ucx-py should support any serialization directly.
But it would be great if cuDF could serialize/deserialize multiple columns to/from a single device_buffer.

kkraus14 · 2020-03-04T16:12:39Z

@jrhemstad I don't think we have any special requirements for the serialized layout other than using the typical standard containers of things like rmm::device_buffer, and std::vector so that we can easily send them over the wire without having to play games.

From our perspective we don't have a lot of opinions as our path would look something like:
Python Table --> cudf::table_view --> cudf::pack --> (rmm::device_buffer, std::vector) --> UCX Send(s) --> UCX Receive(s) --> (rmm::device_buffer, std::vector) --> cudf::unpack --> cudf::table_view --> Python Table

jlowe · 2020-03-04T17:07:00Z

I don't think we have any special requirements for the serialized layout other than using the typical standard containers of things

+1 from the Spark plugin perspective. We should be able to work with just the device buffer representing the contiguous GPU data and an opaque blob of host bytes representing the serialized table_view.

jakirkham · 2020-03-04T21:37:58Z

@madsbk do you have opinions here as well on things ucx-py could be/should be doing ?

I don't think that ucx-py should support any serialization directly.
But it would be great if cuDF could serialize/deserialize multiple columns to/from a single device_buffer.

Yeah I was thinking we might do this in Distributed before it got to UCX-Py. More-or-less the same as PR ( dask/distributed#3453 ).

jakirkham · 2020-03-27T03:44:31Z

FWIW I think from the Dask side w.r.t. cuDF we would be well served for most cases by having something we could plug into cuDF "cuda" serialization. We could then reuse this for transmission with TCP or UCX and spilling since they all go through that code path now.

There are some other objects we would like similar functionality (like CuPy sparse arrays). Though I think other Numba and CuPy objects are already working with contiguous memory chunks to the extent possible.

@nvdbaranec

…format. (#7096) Addresses #3793 Depends on #6864 (This affects contiguous_split.cu. For the purposes of this PR, the only changes that are relevant are those that involve the generation of metadata) - `pack()` performs a `contiguous_split()` on the incoming table to arrange the memory into a unified device buffer, and generates a host-side metadata buffer. These are returned in the `packed_columns` struct. - unpack() takes the data stored in the `packed_columns` struct and returns a deserialized `table_view` that points into it. The intent of this functionality is as follows (pseudocode) ``` // serialize-side table_view t; packed_columns p = pack(t); send_over_network(p.gpu_data); send_over_network(p.metadata); // deserialize-side packed_columns p = receive_from_network(); table_view t = unpack(p); ``` This PR also renames `contiguous_split_result` to `packed_table` (which is just a bundled `table_view` and `packed_column`) Authors: - @nvdbaranec Approvers: - Jake Hemstad (@jrhemstad) - Paul Taylor (@trxcllnt) - Mike Wilson (@hyperbolic2346) URL: #7096

jrhemstad · 2021-03-15T18:23:43Z

This was closed by #7096

jrhemstad added the feature request New feature or request label Jan 15, 2020

harrism added the libcudf Affects libcudf (C++/CUDA) code. label Feb 10, 2020

jakirkham mentioned this issue Feb 29, 2020

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream rapidsai/rmm#313

Closed

jakirkham mentioned this issue Mar 27, 2020

Consolidating DtoH and HtoD operations during spilling rapidsai/dask-cuda#250

Closed

jakirkham mentioned this issue Apr 3, 2020

[WIP] Pack/unpack cuDF frames during serialization #4803

Closed

jrhemstad self-assigned this Apr 16, 2020

devavret mentioned this issue Apr 18, 2020

[WIP] Pack and unpack of table_view for serialization #4941

Closed

jakirkham mentioned this issue Apr 21, 2020

Consolidate messages in UCX dask/distributed#3732

Open

nvdbaranec mentioned this issue Jan 7, 2021

Pack/unpack functionality to convert tables to and from a serialized format. #7096

Merged

jrhemstad closed this as completed Mar 15, 2021

jakirkham mentioned this issue Nov 18, 2021

[FEA] pack/unpack functions to merge/split (multiple) device_buffer(s) #9726

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Function to "pack" a table into a single buffer #3793

[FEA] Function to "pack" a table into a single buffer #3793

jrhemstad commented Jan 15, 2020 •

edited

Loading

felipeblazing commented Jan 15, 2020

jrhemstad commented Jan 15, 2020 •

edited

Loading

jakirkham commented Feb 10, 2020

jrhemstad commented Mar 4, 2020

quasiben commented Mar 4, 2020

madsbk commented Mar 4, 2020

kkraus14 commented Mar 4, 2020 •

edited

Loading

jlowe commented Mar 4, 2020

jakirkham commented Mar 4, 2020

jakirkham commented Mar 27, 2020

jrhemstad commented Mar 15, 2021

[FEA] Function to "pack" a table into a single buffer #3793

[FEA] Function to "pack" a table into a single buffer #3793

Comments

jrhemstad commented Jan 15, 2020 • edited Loading

felipeblazing commented Jan 15, 2020

jrhemstad commented Jan 15, 2020 • edited Loading

jakirkham commented Feb 10, 2020

jrhemstad commented Mar 4, 2020

quasiben commented Mar 4, 2020

madsbk commented Mar 4, 2020

kkraus14 commented Mar 4, 2020 • edited Loading

jlowe commented Mar 4, 2020

jakirkham commented Mar 4, 2020

jakirkham commented Mar 27, 2020

jrhemstad commented Mar 15, 2021

jrhemstad commented Jan 15, 2020 •

edited

Loading

jrhemstad commented Jan 15, 2020 •

edited

Loading

kkraus14 commented Mar 4, 2020 •

edited

Loading