Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes List offset bug in Nested JSON reader #12060

Merged

Conversation

karthikeyann
Copy link
Contributor

Description

Fixes List offset end last item write condition bug
If there is a list row followed by empty list in next row, the previous row's end is not written to offsets.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue 4 - Needs cuIO Reviewer non-breaking Non-breaking change labels Nov 3, 2022
@karthikeyann karthikeyann added this to the Nested JSON reader milestone Nov 3, 2022
@karthikeyann karthikeyann self-assigned this Nov 3, 2022
@karthikeyann karthikeyann requested review from a team as code owners November 3, 2022 17:28
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 3, 2022
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my limited understanding of the JSON code and the contents of the test the fix looks correct. One small stylistic comment.

cpp/src/io/json/json_column.cu Show resolved Hide resolved
@@ -525,14 +525,15 @@ void make_device_json_column(device_span<SymbolT const> input,
auto parent_node_id = ordered_parent_node_ids[i];
if (parent_node_id != parent_node_sentinel and node_categories[parent_node_id] == NC_LIST) {
// unique item
if (i == 0 ||
if (i == 0 or
(col_ids[i - 1] != col_ids[i] or ordered_parent_node_ids[i - 1] != parent_node_id)) {
// scatter to list_offset
d_columns_data[original_col_ids[parent_node_id]]
.child_offsets[row_offsets[parent_node_id]] = ordered_row_offsets[i];
}
// TODO: verify if this code is right. check with more test cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still needed? Or does this PR cover enough test cases that we now trust its correctness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will keep the TODO for now.
This PR has test cases to test the conditions. But it needs a few more test cases to be thorough.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks fine, and thanks for the new test. I have one question about removing a TODO.

@karthikeyann
Copy link
Contributor Author

rerun tests

1 similar comment
@karthikeyann
Copy link
Contributor Author

rerun tests

@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Base: 87.47% // Head: 88.12% // Increases project coverage by +0.64% 🎉

Coverage data is based on head (8ab9608) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head 8ab9608 differs from pull request most recent head 9ad28d9. Consider uploading reports for the commit 9ad28d9 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12060      +/-   ##
================================================
+ Coverage         87.47%   88.12%   +0.64%     
================================================
  Files               133      135       +2     
  Lines             21826    21997     +171     
================================================
+ Hits              19093    19384     +291     
+ Misses             2733     2613     -120     
Impacted Files Coverage Δ
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
python/dask_cudf/dask_cudf/backends.py 84.90% <0.00%> (-0.37%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.62% <0.00%> (-0.09%) ⬇️
python/cudf/cudf/io/orc.py 92.94% <0.00%> (-0.09%) ⬇️
python/cudf/cudf/core/dataframe.py 93.67% <0.00%> (-0.06%) ⬇️
... and 35 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e788f36 into rapidsai:branch-22.12 Nov 4, 2022
rapids-bot bot pushed a commit that referenced this pull request Nov 16, 2022
This PR adds support for byte_range to be used in nested JSON parser for JSON Lines format (newline delimited JSON http://ndjson.org/)
The record delimiter "New lines" are only expected at the end of each record. Newlines in middle of record or within quotes are not expected and will lead to unknown behaviour. The record delimiters are not context aware in this PR.

This PR provides libcudf APIs, Cython APIs and python tests to enable byte range support. This will allow dask to do distributed/segmented parsing of JSON.

No Dask changes

Addresses part of #11843
Depends on  #12060

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Elias Stehle (https://github.com/elstehle)
  - Lawrence Mitchell (https://github.com/wence-)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #12017
@vyasr vyasr added the 4 - Needs Review Waiting for reviewer to review or respond label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants