Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte_range support for multibyte_split/read_text #10150

Merged
merged 45 commits into from
Mar 1, 2022

Conversation

cwharris
Copy link
Contributor

Adding byte_range support to multibyte_split/read_text.

Closes #9655

providing a byte range in terms of (offset, size) allows multibyte_split to read a whole file, but only return the offsets within those ranges as well as one additional offset (unless it's the end of the file). If thinking in terms of "records", where each delimiter dictates the end of a record, we effectively return all records which begin within the byte range provided, and ignore all other records, including any record which may end (but not begin) within the range, and including any record which may begin in the range but end outside of the range.

examples:

input: "abc..def..ghi..jkl.."
delimiter: ..
range offset: 0
range size: 2
output: ["abc.."]
range offset: 2
range size: 9
output: ["def..", "ghi.."]
range offset: 11
range size: 2
output: []
range offset: 13
range size: 7
output: ["jkl..", ""]

@cwharris cwharris requested review from a team as code owners January 27, 2022 19:25
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jan 27, 2022
@cwharris cwharris added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jan 27, 2022
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

cpp/include/cudf/io/text/data_chunk_source.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/text/byte_range_info.hpp Show resolved Hide resolved
cpp/include/cudf/io/text/byte_range_info.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/text/byte_range_info.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/text/data_chunk_source_factories.hpp Outdated Show resolved Hide resolved
cpp/tests/io/text/multibyte_split_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/text/multibyte_split_test.cpp Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_text.py Show resolved Hide resolved
python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved
@cwharris cwharris requested a review from vyasr February 28, 2022 21:01
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few outstanding items, but nothing blocking.

cpp/src/io/text/byte_range_info.cpp Outdated Show resolved Hide resolved
return static_cast<int32_t>(offset - relevant_offset_first);
});

stream.synchronize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, did you mean to delete this?

python/cudf/cudf/tests/test_text.py Show resolved Hide resolved
@cwharris
Copy link
Contributor Author

cwharris commented Mar 1, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 78b316c into rapidsai:branch-22.04 Mar 1, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 2, 2022
#10150 broke compiler support for GCC 11 (built locally) because it was missing `#include <optional>` in a couple files. This fixes it. cc: @cwharris

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #10385
rapids-bot bot pushed a commit that referenced this pull request Mar 22, 2022
This uses the `byte_range` argument added to read_text with #10150 to create a dask version of `read_text`

Authors:
  - https://github.com/ChrisJar

Approvers:
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #10407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support reading "chunks" of large text files w/ cudf.read_text
9 participants