Adding string row size iterator for row to column and column to row conversion #10157

hyperbolic2346 · 2022-01-28T07:14:27Z

This is the first step to supporting variable-width strings in the row to column and column to row code. It adds an iterator that reads the offset columns inside string columns to compute the row sizes of this variable-width data.

Note that this doesn't add support for strings yet, but is the first step in that direction.

closes #10111

…ing-iterator

jjacobelli · 2022-01-28T14:37:39Z

rerun tests

java/src/main/native/src/row_conversion.cu

ttnghia · 2022-02-07T19:50:17Z

java/src/main/native/src/row_conversion.cu

+  auto data_iter = cudf::detail::make_counting_transform_iterator(
+      0, [d_string_columns_offsets = d_string_columns_offsets.data(), num_columns,
+          num_rows] __device__(auto element_idx) {
+        auto const row = element_idx / num_columns;
+        auto const col = element_idx % num_columns;
+
+        return d_string_columns_offsets[col][row + 1] - d_string_columns_offsets[col][row];
+      });


Humm, from my perspective this computation is inefficient. You are looping col-by-col. That means, for each row, you iteratively access all the cols before going to the next row. Each col will be accessed separately by num_rows times.

How about this?

auto const row = element_idx % num_rows; auto const col = element_idx / num_rows; ...

This way, you may not be able to use reduce_by_key. Instead, you need to initialize the d_row_offsets to zero (thrust::uninitialized_fill) then atomicAdd each output value.

I'm not sure if this solution is more efficient. It should if we have large number of columns. Otherwise I don't know.

If you can have a benchmark to compare the solutions then it's great 😄

Spark max columns defaults to 100 and it seems far more likely to have a very large number of rows. With the requirement of keys being consecutive we can't simply flip the math. I will do some performance testing and report back.

I performance tested this code and it seems this function runs in about 1.2ms on my PC for 50 columns and 1,000,000 rows of intermixed ints and string columns. With the changes to not use reduce_by_key and march the data in a more natural way this time drops to 0.75ms. This seems worth it even though it removes the chance of the cool transform output iterator suggested in review. Thanks for pushing for this. I dismissed it probably because I was excited to use reduce_by_key.

@revans2 is that default limit 100 or have I been led astray by my reading?

Yes it is 100.

https://github.com/apache/spark/blob/899d3bb44d7c72dc0179545189ac8170bde993a8/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1356-L1362

But it actually calculates it with nesting levels too, so it is a bit more complicated.

https://github.com/apache/spark/blob/899d3bb44d7c72dc0179545189ac8170bde993a8/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L566-L576

march the data in a more natural way this time drops to 0.75ms.

Kudos, @ttnghia and @hyperbolic2346!
I'm having a hard time grokking why this iteration order is faster. All the string columns have to eventually be accessed num_rows times. So this should be a matter of... proximity? All threads in the warp acting on proximal locations in memory?

In the old way: we access rows 0 of all columns 0, 1, 2, etc then we access rows 1 of all columns 0, 1, 2, etc and so on. Each row access will pull data from different columns from different locations in memory.
In the new way: we access rows 0, 1, 2, etc of column 0, then rows 0, 1, 2, etc of column 1 and so on. So the data is pulled from contiguous memory locations.

Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>

revans2

It looks good, but my C++ is not great so I am not going to approve this.

java/src/main/native/src/row_conversion.cu

…ing-iterator

mythrocks

I'm 👍 on the changes, after the other reviewers' comments are addressed.
Thank you for your patience, @hyperbolic2346.

hyperbolic2346 · 2022-02-10T02:58:34Z

rerun tests

java/src/main/native/src/row_conversion.cu

Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>

…udf into mwilson/string-iterator

hyperbolic2346 · 2022-02-11T03:02:53Z

@gpucibot merge

This is the code for the column to row portion of the string work. This code will convert a table that includes strings into the JCUDF row format. This depends on #10157 and as such, is a draft PR until that is merged. I am putting this up now so people working on reviewing that PR can see where it is headed. closes #10234 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) - https://github.com/nvdbaranec URL: #10235

hyperbolic2346 added 2 commits January 28, 2022 06:56

Adding string row iterator for variable-width string data conversion

fdc251c

Merge remote-tracking branch 'upstream/branch-22.04' into mwilson/str…

7c2557d

…ing-iterator

hyperbolic2346 requested a review from a team as a code owner January 28, 2022 07:14

github-actions bot added the Java Affects Java cuDF API. label Jan 28, 2022

hyperbolic2346 added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change feature request New feature or request labels Jan 28, 2022

This comment was marked as off-topic.

Sign in to view

jjacobelli closed this Jan 28, 2022

jjacobelli reopened this Jan 28, 2022

hyperbolic2346 mentioned this pull request Jan 28, 2022

[FEA]Add support in column to row conversion for strings #10160

Closed

hyperbolic2346 requested review from nvdbaranec, mythrocks and ttnghia February 4, 2022 00:03

hyperbolic2346 mentioned this pull request Feb 7, 2022

Column to JCUDF row for tables with strings #10235

Merged

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 7, 2022

View reviewed changes

hyperbolic2346 and others added 3 commits February 7, 2022 16:02

Update java/src/main/native/src/row_conversion.cu

267b67e

Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>

Update java/src/main/native/src/row_conversion.cu

cb5859c

Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>

updates from review comments

86bcef1

Fixing overzealous renaming

d26d72d

revans2 reviewed Feb 8, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved

nvdbaranec requested changes Feb 8, 2022

View reviewed changes

hyperbolic2346 added 3 commits February 8, 2022 23:20

updating from review comments

3c17069

fixing doc strings

c27f2d0

Removing now unused variable

2f17f86

mythrocks reviewed Feb 9, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved

Merge remote-tracking branch 'upstream/branch-22.04' into mwilson/str…

b1b4d32

…ing-iterator

mythrocks approved these changes Feb 10, 2022

View reviewed changes

hyperbolic2346 requested review from nvdbaranec and ttnghia February 10, 2022 02:58

ttnghia reviewed Feb 10, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

ttnghia reviewed Feb 10, 2022

View reviewed changes

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved

hyperbolic2346 and others added 3 commits February 10, 2022 01:21

Update java/src/main/native/src/row_conversion.cu

524f3c3

Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>

removing unused functor

615cc8f

Merge branch 'mwilson/string-iterator' of github.com:hyperbolic2346/c…

19a7149

…udf into mwilson/string-iterator

hyperbolic2346 requested a review from ttnghia February 10, 2022 07:21

ttnghia approved these changes Feb 10, 2022

View reviewed changes

nvdbaranec approved these changes Feb 10, 2022

View reviewed changes

linting

1a28563

rapids-bot bot merged commit dcac052 into rapidsai:branch-22.04 Feb 11, 2022

hyperbolic2346 deleted the mwilson/string-iterator branch February 11, 2022 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding string row size iterator for row to column and column to row conversion #10157

Adding string row size iterator for row to column and column to row conversion #10157

hyperbolic2346 commented Jan 28, 2022

This comment was marked as off-topic.

jjacobelli commented Jan 28, 2022

ttnghia Feb 7, 2022 •

edited

Loading

ttnghia Feb 7, 2022 •

edited

Loading

ttnghia Feb 7, 2022 •

edited

Loading

ttnghia Feb 7, 2022

hyperbolic2346 Feb 7, 2022

hyperbolic2346 Feb 8, 2022

hyperbolic2346 Feb 8, 2022

revans2 Feb 8, 2022

mythrocks Feb 8, 2022

ttnghia Feb 10, 2022

revans2 left a comment

mythrocks left a comment

hyperbolic2346 commented Feb 10, 2022

hyperbolic2346 commented Feb 11, 2022

Adding string row size iterator for row to column and column to row conversion #10157

Adding string row size iterator for row to column and column to row conversion #10157

Conversation

hyperbolic2346 commented Jan 28, 2022

This comment was marked as off-topic.

jjacobelli commented Jan 28, 2022

ttnghia Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

ttnghia Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

ttnghia Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

ttnghia Feb 7, 2022

Choose a reason for hiding this comment

hyperbolic2346 Feb 7, 2022

Choose a reason for hiding this comment

hyperbolic2346 Feb 8, 2022

Choose a reason for hiding this comment

hyperbolic2346 Feb 8, 2022

Choose a reason for hiding this comment

revans2 Feb 8, 2022

Choose a reason for hiding this comment

mythrocks Feb 8, 2022

Choose a reason for hiding this comment

ttnghia Feb 10, 2022

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

hyperbolic2346 commented Feb 10, 2022

hyperbolic2346 commented Feb 11, 2022

ttnghia Feb 7, 2022 •

edited

Loading

ttnghia Feb 7, 2022 •

edited

Loading

ttnghia Feb 7, 2022 •

edited

Loading