-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes List offset bug in Nested JSON reader #12060
Fixes List offset bug in Nested JSON reader #12060
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my limited understanding of the JSON code and the contents of the test the fix looks correct. One small stylistic comment.
@@ -525,14 +525,15 @@ void make_device_json_column(device_span<SymbolT const> input, | |||
auto parent_node_id = ordered_parent_node_ids[i]; | |||
if (parent_node_id != parent_node_sentinel and node_categories[parent_node_id] == NC_LIST) { | |||
// unique item | |||
if (i == 0 || | |||
if (i == 0 or | |||
(col_ids[i - 1] != col_ids[i] or ordered_parent_node_ids[i - 1] != parent_node_id)) { | |||
// scatter to list_offset | |||
d_columns_data[original_col_ids[parent_node_id]] | |||
.child_offsets[row_offsets[parent_node_id]] = ordered_row_offsets[i]; | |||
} | |||
// TODO: verify if this code is right. check with more test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO still needed? Or does this PR cover enough test cases that we now trust its correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will keep the TODO for now.
This PR has test cases to test the conditions. But it needs a few more test cases to be thorough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks fine, and thanks for the new test. I have one question about removing a TODO
.
rerun tests |
1 similar comment
rerun tests |
Codecov ReportBase: 87.47% // Head: 88.12% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #12060 +/- ##
================================================
+ Coverage 87.47% 88.12% +0.64%
================================================
Files 133 135 +2
Lines 21826 21997 +171
================================================
+ Hits 19093 19384 +291
+ Misses 2733 2613 -120
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@gpucibot merge |
This PR adds support for byte_range to be used in nested JSON parser for JSON Lines format (newline delimited JSON http://ndjson.org/) The record delimiter "New lines" are only expected at the end of each record. Newlines in middle of record or within quotes are not expected and will lead to unknown behaviour. The record delimiters are not context aware in this PR. This PR provides libcudf APIs, Cython APIs and python tests to enable byte range support. This will allow dask to do distributed/segmented parsing of JSON. No Dask changes Addresses part of #11843 Depends on #12060 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Elias Stehle (https://github.com/elstehle) - Lawrence Mitchell (https://github.com/wence-) - Robert Maynard (https://github.com/robertmaynard) URL: #12017
Description
Fixes List offset end last item write condition bug
If there is a list row followed by empty list in next row, the previous row's end is not written to offsets.
Checklist