-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move chars column to parent data buffer in strings column #14202
Move chars column to parent data buffer in strings column #14202
Conversation
…_limit_experiment
…_limit_experiment
@galipremsagar failing pytests Update: these issues are fixed. |
use Optional move string special case for data to string.py use data for memory usage calculation remove chars_data declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving python changes with (non-blocking) suggestion to introduce a single type definition for the char type.
build_column( | ||
data=as_buffer( | ||
rmm.DeviceBuffer( | ||
size=row_count * cudf.dtype("int8").itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this condition can also be true for (at least) List and Struct columns. But, those are handled by specific cases above.
From a quick test, I think we don't need an empty chars buffer, so can you try using rmm.DeviceBuffer(0)
?
@@ -5938,15 +5930,15 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase": | |||
str_end_byte_offset = self.base_children[0].element_indexing( | |||
self.offset + self.size | |||
) | |||
char_dtype_size = self.base_children[1].dtype.itemsize | |||
char_dtype_size = cudf.api.types.dtype("int8").itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Can we introduce (like cudf._lib.types.size_type_dtype
) a single source of truth for the type of the string char buffer, perhaps cudf._lib.types.char_type_dtype
?
We'll probably want to update the developer's guide once this is merged as well |
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional comment otherwise LGTM
@@ -5938,15 +5930,15 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase": | |||
str_end_byte_offset = self.base_children[0].element_indexing( | |||
self.offset + self.size | |||
) | |||
char_dtype_size = self.base_children[1].dtype.itemsize | |||
char_dtype_size = cudf.api.types.dtype("int8").itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth adding a # TODO
comment noting that int8 is a workaround
/merge |
Fixes deprecation warnings introduced when #14202 merged. Most of these are for calls to `cudf::make_strings_column` which deprecated the chars-column function overload. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14771
Removes the functions deprecated in 24.02 in #14202. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14848
This PR contains a number of different fixes currently required to get cugraph tests passing: - There are two main changes for pandas 2 compatibility: - [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`. - Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion. - cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy. - rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs. This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) - Chuck Hastings (https://github.com/ChuckHastings) - Rick Ratzel (https://github.com/rlratzel) - Jake Awe (https://github.com/AyodeAwe) URL: #4144
Description
Eliminates chars column and moves chars data to parent string column's _data buffer.
Summary of changes
chars_size()
,chars_end()
instrings_column_view
and their invocationschars_column_index
, and deprecatechars()
fromstrings_column_view
chars_col.begin<char>()
withstatic_cast<char*>(parent.head())
rmm::device_buffer
instead of chars columnstream
parameter to chars_size)Preparing for #13733