Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNI: Add generateListOffsets API #10683

Merged
merged 12 commits into from
Apr 26, 2022

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Apr 19, 2022

Add generateListOffsets API, converting list lengths to list offsets, which is useful in the development of spark-rapids.

For example, the support of array_repeat and arrays_zip relies on this API.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx added feature request New feature or request Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Apr 19, 2022
@sperlingxx sperlingxx requested a review from revans2 April 19, 2022 11:41
@sperlingxx sperlingxx requested a review from a team as a code owner April 19, 2022 11:41
@codecov
Copy link

codecov bot commented Apr 19, 2022

Codecov Report

Merging #10683 (2fd73f9) into branch-22.06 (17d49fa) will increase coverage by 0.01%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10683      +/-   ##
================================================
+ Coverage         86.38%   86.39%   +0.01%     
================================================
  Files               142      142              
  Lines             22333    22304      -29     
================================================
- Hits              19292    19270      -22     
+ Misses             3041     3034       -7     
Impacted Files Coverage Δ
python/cudf/cudf/utils/gpu_utils.py 50.00% <0.00%> (-4.29%) ⬇️
python/cudf/cudf/core/frame.py 93.41% <0.00%> (-0.26%) ⬇️
python/cudf/cudf/core/indexed_frame.py 91.70% <0.00%> (-0.07%) ⬇️
python/cudf/cudf/core/dataframe.py 93.75% <0.00%> (-0.05%) ⬇️
python/cudf/cudf/core/column/column.py 89.43% <0.00%> (-0.02%) ⬇️
python/cudf/cudf/core/series.py 95.16% <0.00%> (+<0.01%) ⬆️
python/cudf/cudf/utils/utils.py 90.35% <0.00%> (+0.06%) ⬆️
python/cudf/cudf/core/column/string.py 89.21% <0.00%> (+0.10%) ⬆️
python/cudf/cudf/testing/_utils.py 93.98% <0.00%> (+0.13%) ⬆️
python/cudf/cudf/core/multiindex.py 92.28% <0.00%> (+0.13%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17d49fa...2fd73f9. Read the comment docs.

@ttnghia ttnghia self-requested a review April 20, 2022 18:10
#include <cudf/detail/iterator.cuh>
#include <cudf/detail/valid_if.cuh>
#include <cudf/strings/detail/utilities.cuh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're having to include a header from a detail folder, something is wrong. These can break at any time, for any reason, and there will be no sympathy for any code that breaks.

Comment on lines 61 to 64
auto const begin_iter = list_length.template begin<cudf::size_type>();
auto const end_iter = list_length.template end<cudf::size_type>();

return cudf::strings::detail::make_offsets_child_column(begin_iter, end_iter, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to use a detail API here. This is just a scan and can use cudf::scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jrhemstad, we need to allocate a output buffer whose size is N + 1. Then, project the result of inclusive scan to the output buffer (with the offset 1).
Is thrust::inclusive_scan more flexible to do this job than cudf::scan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong, we need to run an extra memcpy with the cudf::scan.

Copy link
Contributor

@ttnghia ttnghia Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, if we can't use the detail API then I hate to recommend this: just copy exactly the code from cudf::strings::detail::make_offsets_child_column. It will break such dependency, with the cost of having duplicate implementation.

Copy link
Contributor Author

@sperlingxx sperlingxx Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I rollbacked the change as it is unsafe to use a detail API outside libcudf.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
sperlingxx and others added 2 commits April 22, 2022 09:47
Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@@ -3473,6 +3473,16 @@ public final ColumnVector listSortRows(boolean isDescending, boolean isNullSmall
return new ColumnVector(listSortRows(getNativeView(), isDescending, isNullSmallest));
}

/**
* Generate list offsets from sizes of each list.
* NOTICE: This column should type in INT32. And no null and negative value is allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this requirement for the input column ? What if it is not matched ? Better to add comment for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@sperlingxx
Copy link
Contributor Author

rerun tests

1 similar comment
@sperlingxx
Copy link
Contributor Author

rerun tests

@sperlingxx
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cc0bf12 into rapidsai:branch-22.06 Apr 26, 2022
@sperlingxx sperlingxx deleted the gen_list_offset branch April 26, 2022 09:36
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants