Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor strings column factories #7397

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
7cfdbd5
Initial attempt at using span/iterators in make_strings_columns (WIP)
harrism Feb 16, 2021
a3a1cfa
Fix zip iterator
harrism Feb 17, 2021
99eea9e
Convert chars, offsets, nulls make_strings_column to iterators
harrism Feb 17, 2021
189fdf5
use device_span version of make_strings_column in scan.cu
harrism Feb 17, 2021
01a02e2
Add another device_span version of make_strings_column
harrism Feb 17, 2021
1b8b200
Use device_span version of make_strings_column in CSV reader_impl.cu
harrism Feb 17, 2021
b33ca20
make_strings_column from std::vector uses spans internally
harrism Feb 23, 2021
f43aeb1
Clean up binops/scan spans.
harrism Feb 23, 2021
f5ecc6e
Merge branch 'branch-0.19' into fea-refactor-strings-factories
harrism Feb 23, 2021
61604d1
Remove errant std::cout
harrism Feb 23, 2021
43a3192
Make span classes part of public interface and refactor strings colum…
harrism Feb 23, 2021
bc55428
Merge branch 'branch-0.19' into fea-refactor-strings-factories
harrism Feb 23, 2021
9962b9b
Add make_device_uvector_* utilities
harrism Feb 24, 2021
75937af
Eliminate std::vector version of make_strings_column factory
harrism Feb 24, 2021
76e67a6
Add vector_factories.hpp to meta.yaml
harrism Feb 24, 2021
164a6b2
Don't put return type in doxygen @return
harrism Feb 24, 2021
1299e01
Remove unnecessary includes and add CUDF_FUNC_RANGE
harrism Feb 24, 2021
f940ee5
Convert strings extract vector to uvector
harrism Feb 24, 2021
7ec1d44
strings findall device_uvector to device_vector
harrism Feb 24, 2021
b9d6972
strings partition device_vector -> uvector
harrism Feb 25, 2021
55d2e4e
device_vector ->uvector in strings split / split_record
harrism Feb 25, 2021
ed0fd9d
device_vector -> uvector in tokenize
harrism Feb 25, 2021
5dc8e4a
Merge branch 'branch-0.19' into fea-refactor-strings-factories
harrism Mar 2, 2021
3fad943
Clean up vector_factories.hpp
harrism Mar 2, 2021
aeff8d2
Add missing const
harrism Mar 2, 2021
0cd009e
Add sync
harrism Mar 2, 2021
71db9af
Docs cleanup
harrism Mar 2, 2021
6596426
Remove unnecessary syncs in test
harrism Mar 2, 2021
4a4c8fc
Copyrights
harrism Mar 3, 2021
8ec27bf
Remove blank line
harrism Mar 3, 2021
8b132c4
Merge branch 'branch-0.19' into fea-refactor-strings-factories
harrism Mar 3, 2021
fb3c023
Fix detail::span
harrism Mar 4, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ test:
- test -f $PREFIX/include/cudf/detail/utilities/alignment.hpp
- test -f $PREFIX/include/cudf/detail/utilities/integer_utils.hpp
- test -f $PREFIX/include/cudf/detail/utilities/int_fastdiv.h
- test -f $PREFIX/include/cudf/detail/utilities/vector_factories.hpp
- test -f $PREFIX/include/cudf/dictionary/detail/concatenate.hpp
- test -f $PREFIX/include/cudf/dictionary/detail/encode.hpp
- test -f $PREFIX/include/cudf/dictionary/detail/merge.hpp
Expand Down
7 changes: 6 additions & 1 deletion cpp/benchmarks/common/generate_benchmark_input.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_buffer.hpp>
#include <rmm/device_vector.hpp>

#include <future>
#include <memory>
Expand Down Expand Up @@ -411,7 +412,11 @@ std::unique_ptr<cudf::column> create_random_column<cudf::string_view>(data_profi
row += std::max(run_len - 1, 0);
}
}
return cudf::make_strings_column(out_col.chars, out_col.offsets, out_col.null_mask);

rmm::device_vector<char> d_chars(out_col.chars);
rmm::device_vector<cudf::size_type> d_offsets(out_col.offsets);
rmm::device_vector<cudf::bitmask_type> d_null_mask(out_col.null_mask);
return cudf::make_strings_column(d_chars, d_offsets, d_null_mask);
}

template <>
Expand Down
1 change: 0 additions & 1 deletion cpp/benchmarks/copying/shift_benchmark.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@

#include <benchmark/benchmark.h>

#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
Expand Down
128 changes: 39 additions & 89 deletions cpp/include/cudf/column/column_factories.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include <cudf/column/column.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/span.hpp>
#include <cudf/utilities/traits.hpp>

#include <rmm/cuda_stream_view.hpp>
Expand Down Expand Up @@ -330,7 +331,7 @@ std::unique_ptr<column> make_fixed_width_column(
}

/**
* @brief Construct STRING type column given a device vector of pointer/size pairs.
* @brief Construct STRING type column given a device span of pointer/size pairs.
harrism marked this conversation as resolved.
Show resolved Hide resolved
* The total number of char bytes must not exceed the maximum size of size_type.
* The string characters are expected to be UTF-8 encoded sequence of char
* bytes. Use the strings_column_view class to perform strings operations on
Expand All @@ -344,20 +345,19 @@ std::unique_ptr<column> make_fixed_width_column(
*
* @throws std::bad_alloc if device memory allocation fails
*
* @param[in] strings The vector of pointer/size pairs.
* Each pointer must be a device memory address or `nullptr`
* (indicating a null string). The size must be the number of bytes.
* @param[in] strings The device span of pointer/size pairs. Each pointer must be a device memory
address or `nullptr` (indicating a null string). The size must be the number of bytes.
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
* @param[in] mr Device memory resource used for allocation of the column's `null_mask` and children
* columns' device memory.
*/
std::unique_ptr<column> make_strings_column(
const rmm::device_vector<thrust::pair<const char*, size_type>>& strings,
cudf::device_span<thrust::pair<const char*, size_type> const> strings,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Construct STRING type column given a device vector of string_view.
* @brief Construct STRING type column given a device span of string_view.
harrism marked this conversation as resolved.
Show resolved Hide resolved
* The total number of char bytes must not exceed the maximum size of size_type.
* The string characters are expected to be UTF-8 encoded sequence of char
* bytes. Use the strings_column_view class to perform strings operations on
Expand All @@ -372,118 +372,68 @@ std::unique_ptr<column> make_strings_column(
*
* @throws std::bad_alloc if device memory allocation fails
*
* @param[in] string_views The vector of string_view.
* Each string_view must point to a device memory address or
* `null_placeholder` (indicating a null string). The size must be the number of
* bytes.
* @param[in] string_views The span of string_view. Each string_view must point to a device memory
address or `null_placeholder` (indicating a null string). The size must be the number of bytes.
* @param[in] null_placeholder string_view indicating null string in given list of
* string_views.
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
* @param[in] mr Device memory resource used for allocation of the column's `null_mask` and children
* columns' device memory.
*/
std::unique_ptr<column> make_strings_column(
const rmm::device_vector<string_view>& string_views,
cudf::device_span<string_view const> string_views,
const string_view null_placeholder,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Construct STRING type column given a device vector of chars
* encoded as UTF-8, a device vector of byte offsets identifying individual
* strings within the char vector, and an optional null bitmask.
* @brief Construct STRING type column given a device span of chars encoded as UTF-8, a device
* span of byte offsets identifying individual strings within the char vector, and an optional
* null bitmask.
*
* `offsets.front()` must always be zero.
*
* The total number of char bytes must not exceed the maximum size of size_type.
* Use the strings_column_view class to perform strings operations on this type
* of column.
* This function makes a deep copy of the strings, offsets, null_mask to create
* a new column.
* The total number of char bytes must not exceed the maximum size of size_type. Use the
* strings_column_view class to perform strings operations on this type of column.
*
* @throws std::bad_alloc if device memory allocation fails
*
* @param[in] strings The vector of chars in device memory.
* This char vector is expected to be UTF-8 encoded characters.
* @param[in] offsets The vector of byte offsets in device memory.
* The number of elements is one more than the total number
* of strings so the `offsets.back()` is the total
* number of bytes in the strings array.
* `offsets.front()` must always be 0 to point to the beginning
* of `strings`.
* @param[in] null_mask Device vector containing the null element indicator bitmask.
* Arrow format for nulls is used for interpeting this bitmask.
* @param[in] null_count The number of null string entries. If equal to
* `UNKNOWN_NULL_COUNT`, the null count will be computed dynamically on the
* first invocation of `column::null_count()`
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
* @param[in] mr Device memory resource used for allocation of the column's `null_mask` and children
* columns' device memory.
*/
std::unique_ptr<column> make_strings_column(
const rmm::device_vector<char>& strings,
const rmm::device_vector<size_type>& offsets,
const rmm::device_vector<bitmask_type>& null_mask = {},
size_type null_count = cudf::UNKNOWN_NULL_COUNT,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Construct STRING type column given a host vector of chars
* encoded as UTF-8, a host vector of byte offsets identifying individual
* strings within the char vector, and an optional null bitmask.
*
* `offsets.front()` must always be zero.
*
* The total number of char bytes must not exceed the maximum size of size_type.
* Use the strings_column_view class to perform strings operations on this type
* of column.
* This function makes a deep copy of the strings, offsets, null_mask to create
* a new column.
* This function makes a deep copy of the strings, offsets, null_mask to create a new column.
*
* @throws std::bad_alloc if device memory allocation fails
*
* @param[in] strings The contiguous array of chars in host memory.
* This char array is expected to be UTF-8 encoded characters.
* @param[in] offsets The array of byte offsets in host memory.
* The number of elements is one more than the total number
* of strings so the `offsets.back()` is the total
* number of bytes in the strings array.
* `offsets.front()` must always be 0 to point to the beginning
* of `strings`.
* @param[in] null_mask Host vector containing the null element indicator bitmask.
* Arrow format for nulls is used for interpeting this bitmask.
* @param[in] null_count The number of null string entries. If equal to
* `UNKNOWN_NULL_COUNT`, the null count will be computed dynamically on the
* first invocation of `column::null_count()`
* @param[in] strings The device span of chars in device memory. This char vector is expected to be
* UTF-8 encoded characters.
* @param[in] offsets The device span of byte offsets in device memory. The number of elements is
* one more than the total number of strings so the `offsets.back()` is the total number of bytes
* in the strings array. `offsets.front()` must always be 0 to point to the beginning of `strings`.
* @param[in] null_mask Device span containing the null element indicator bitmask. Arrow format for
* nulls is used for interpeting this bitmask.
* @param[in] null_count The number of null string entries. If equal to `UNKNOWN_NULL_COUNT`, the
* null count will be computed dynamically on the first invocation of `column::null_count()`
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
* @param[in] mr Device memory resource used for allocation of the column's `null_mask` and children
* columns' device memory.
*/
std::unique_ptr<column> make_strings_column(
const std::vector<char>& strings,
const std::vector<size_type>& offsets,
const std::vector<bitmask_type>& null_mask = {},
size_type null_count = cudf::UNKNOWN_NULL_COUNT,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
cudf::device_span<char const> strings,
cudf::device_span<size_type const> offsets,
cudf::device_span<bitmask_type const> null_mask = {},
size_type null_count = cudf::UNKNOWN_NULL_COUNT,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Constructs a STRING type column given offsets column, chars columns,
* and null mask and null count. The columns and mask are moved into the
* resulting strings column.
* @brief Constructs a STRING type column given offsets column, chars columns, and null mask and
* null count. The columns and mask are moved into the resulting strings column.
harrism marked this conversation as resolved.
Show resolved Hide resolved
*
* @param[in] num_strings The number of strings the column represents.
* @param[in] offsets_column The column of offset values for this column.
* The number of elements is one more than the total number
* of strings so the offset[last] - offset[0] is the total
* number of bytes in the strings vector.
* @param[in] chars_column The column of char bytes for all the strings for this column.
* Individual strings are identified by the offsets and the
* nullmask.
* @param[in] offsets_column The column of offset values for this column. The number of elements is
* one more than the total number of strings so the `offset[last] - offset[0]` is the total number
* of bytes in the strings vector.
* @param[in] chars_column The column of char bytes for all the strings for this column. Individual
* strings are identified by the offsets and the nullmask.
* @param[in] null_count The number of null string entries.
* @param[in] null_mask The bits specifying the null strings in device memory.
* Arrow format for nulls is used for interpeting this bitmask.
* @param[in] null_mask The bits specifying the null strings in device memory. Arrow format for
* nulls is used for interpeting this bitmask.
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
* @param[in] mr Device memory resource used for allocation of the column's `null_mask` and children
* columns' device memory.
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/detail/utilities/trie.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
#include <cuda_runtime.h>
#include <thrust/host_vector.h>

using cudf::detail::device_span;
using cudf::device_span;

static constexpr char trie_terminating_character = '\n';

Expand Down
Loading