Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344

Merged
merged 28 commits into from
Jul 14, 2023

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented May 12, 2023

Description

This PR adds the option to recover from invalid JSON lines to the JSON tokenizer.

New option and behaviour:

  • We add the option enable_recover_from_error to json_reader_options. When this option is enabled for a JSON lines input, the reader will recover from a parsing error encountered on an invalid JSON line and continue parsing the next line.
  • When the new option is not enabled, we expect the behaviour of existing functionality to remain untouched.
  • When recovering from invalid JSON lines is enabled, all newline characters that are not enclosed in quotes (i.e., newline characters outside of strings and field names) are interpreted as delimiters of a JSON line. We will introduce a new option that reflects this behaviour for JSON lines inputs that should not recover from errors in a future PR. Hence, this PR introduces the JSON_LINES_STRICT enum but does not yet hook it up.

Implementation details:

  • When recovering from invalid JSON lines is enabled, get_token_stream() will delimit each JSON line with a LineEnd token to facilitate the identification of tokens that belong to an invalid JSON line.
  • We extend the logical stack and introduce a new operation, reset(). A reset() operation resets the logical stack to an empty stack. This is necessary to reset the stack of the pushdown automaton (PDA) after an invalid JSON line to make sure the stack in subsequent lines is not corrupted.
  • We modify the transition and translation table of the finite-state transducer (FST) that is used to generate the push-down automaton's (PDA) stack context operations to emit such a reset() operation, iff recovery is enabled.
  • We modify the transition and translation table of the finite-state transducer (FST) that is used to simulate the full PDA to (1) recover after an invalid JSON line and (2) emit the LineEnd token, iff recovery is enabled.
  • To clean up JSON lines that contain tokens belonging to an invalid line, a token post-processing stage is needed. The post-processing will replace sequences of LineEnd token* ErrorBegin with the sequence StructBegin StructEnd (i.e., effectively a null row) for record orient inputs.
  • This post-processing is implemented by running an FST on the reverse token stream, discarding all tokens between ErrorBegin and the next LineEnd, emitting StructBegin StructEnd pairs on the end of such an invalid line.

This is an initial PR to addresses #12532.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 12, 2023
@GregoryKimball
Copy link
Contributor

Hello @elstehle are you ready for testing from the Spark side, or would you like to keep this in draft for now?

@elstehle
Copy link
Contributor Author

Just to emphasise, this PR adds the option to recover to the tokenizer, reflected in the get_token_stream() interface. This can be tested on the Spark side.

To reflect the recovery option in the JSON parser, the post-processing of the tokens and the tree generation for the recovery option need to be adapted too, which depend on the exact behaviour we'd like to have. That will come in a follow-up PR.

@elstehle elstehle force-pushed the feature/json-lines-recovery branch from 49680b1 to c0d5100 Compare June 13, 2023 07:17
@github-actions github-actions bot added ci CMake CMake build issue Java Affects Java cuDF API. Python Affects Python cuDF API. labels Jun 13, 2023
@elstehle elstehle force-pushed the feature/json-lines-recovery branch from c0d5100 to 5be6c14 Compare June 13, 2023 07:18
@elstehle elstehle changed the base branch from branch-23.06 to branch-23.08 June 13, 2023 07:18
@github-actions github-actions bot removed Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. conda labels Jun 13, 2023
@elstehle elstehle added non-breaking Non-breaking change feature request New feature or request cuIO cuIO issue labels Jun 27, 2023
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Did not evaluate the core algorithm, relying on @karthikeyann for that :D

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
{
// Look up the symbol group for given symbol
return temp_storage
.sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)];
}
};

template <typename symbol_t, std::size_t NUM_SYMBOL_GROUPS, typename pre_map_op_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, that's not brief!
Thank you for writing these detailed comments.

@karthikeyann karthikeyann self-requested a review July 11, 2023 18:30
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Great work.

cpp/src/io/fst/lookup_tables.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@karthikeyann karthikeyann changed the title [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer Jul 14, 2023
@elstehle
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 2436e0b into rapidsai:branch-23.08 Jul 14, 2023
53 checks passed
rapids-bot bot pushed a commit that referenced this pull request Aug 9, 2023
This PR simplifies and cleans up the JSON reader's pushdown automaton. 

The pushdown automaton takes as input two arrays:
1. The JSON's input characters
2. The stack context for each character (`{` - `JSON object`, `[` - `JSON array`, `_` - `Root of JSON`)

Previously, we were fusing the two arrays and materializing them straight to the symbol group id for each combination. A symbol group id serves as the column of the transition table. The symbol group ids array was then used as input to the finite state transducer (FST).

After the [recent refactor of the FST](#13344) lookup tables, the FST has become more flexible. It now supports arbitrary iterators and the symbol group id lookup table (that maps a symbol to a symbol group id) can now be implemented by a simple function object. 

This PR takes advantage of the FST's ability to take fancy iterators. We now zip the `json_input` and `stack_context` symbols and pass that `zip_iterator` to the FST.

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #13716
rapids-bot bot pushed a commit that referenced this pull request Aug 11, 2023
#13344 introduced a performance regression to the FST benchmarks that showed as much as a 35% performance degradation. 

It seems that, after the refactor in the above PR, compiler optimization heuristics are deciding differently on loop unrolling in the part of the FST that's writing out transduced symbols. 

As a fix, we are enforcing to not unroll that loop.

Authors:
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - David Wendt (https://github.com/davidwendt)

URL: #13850
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants