[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344

elstehle · 2023-05-12T11:04:07Z

Description

This PR adds the option to recover from invalid JSON lines to the JSON tokenizer.

New option and behaviour:

We add the option enable_recover_from_error to json_reader_options. When this option is enabled for a JSON lines input, the reader will recover from a parsing error encountered on an invalid JSON line and continue parsing the next line.
When the new option is not enabled, we expect the behaviour of existing functionality to remain untouched.
When recovering from invalid JSON lines is enabled, all newline characters that are not enclosed in quotes (i.e., newline characters outside of strings and field names) are interpreted as delimiters of a JSON line. We will introduce a new option that reflects this behaviour for JSON lines inputs that should not recover from errors in a future PR. Hence, this PR introduces the JSON_LINES_STRICT enum but does not yet hook it up.

Implementation details:

When recovering from invalid JSON lines is enabled, get_token_stream() will delimit each JSON line with a LineEnd token to facilitate the identification of tokens that belong to an invalid JSON line.
We extend the logical stack and introduce a new operation, reset(). A reset() operation resets the logical stack to an empty stack. This is necessary to reset the stack of the pushdown automaton (PDA) after an invalid JSON line to make sure the stack in subsequent lines is not corrupted.
We modify the transition and translation table of the finite-state transducer (FST) that is used to generate the push-down automaton's (PDA) stack context operations to emit such a reset() operation, iff recovery is enabled.
We modify the transition and translation table of the finite-state transducer (FST) that is used to simulate the full PDA to (1) recover after an invalid JSON line and (2) emit the LineEnd token, iff recovery is enabled.
To clean up JSON lines that contain tokens belonging to an invalid line, a token post-processing stage is needed. The post-processing will replace sequences of LineEnd token* ErrorBegin with the sequence StructBegin StructEnd (i.e., effectively a null row) for record orient inputs.
This post-processing is implemented by running an FST on the reverse token stream, discarding all tokens between ErrorBegin and the next LineEnd, emitting StructBegin StructEnd pairs on the end of such an invalid line.

This is an initial PR to addresses #12532.

GregoryKimball · 2023-05-15T18:29:47Z

Hello @elstehle are you ready for testing from the Spark side, or would you like to keep this in draft for now?

elstehle · 2023-05-16T18:47:03Z

Just to emphasise, this PR adds the option to recover to the tokenizer, reflected in the get_token_stream() interface. This can be tested on the Spark side.

To reflect the recovery option in the JSON parser, the post-processing of the tokens and the tree generation for the recovery option need to be adapted too, which depend on the exact behaviour we'd like to have. That will come in a follow-up PR.

…n-lines-recovery

cpp/include/cudf/io/json.hpp

cpp/tests/io/nested_json_test.cpp

cpp/src/io/json/nested_json.hpp

vuule

Looks good. Did not evaluate the core algorithm, relying on @karthikeyann for that :D

cpp/include/cudf/io/json.hpp

vuule · 2023-07-07T23:16:57Z

cpp/src/io/fst/lookup_tables.cuh

  {
    // Look up the symbol group for given symbol
    return temp_storage
      .sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)];
  }
 };

+template <typename symbol_t, std::size_t NUM_SYMBOL_GROUPS, typename pre_map_op_t>


wow, that's not brief!
Thank you for writing these detailed comments.

…n-lines-recovery

cpp/src/io/json/nested_json_gpu.cu

…n-lines-recovery

karthikeyann

Looks good to me!
Great work.

cpp/src/io/fst/lookup_tables.cuh

…n-lines-recovery

karthikeyann

LGTM 👍

elstehle · 2023-07-14T16:30:46Z

/merge

This PR simplifies and cleans up the JSON reader's pushdown automaton. The pushdown automaton takes as input two arrays: 1. The JSON's input characters 2. The stack context for each character (`{` - `JSON object`, `[` - `JSON array`, `_` - `Root of JSON`) Previously, we were fusing the two arrays and materializing them straight to the symbol group id for each combination. A symbol group id serves as the column of the transition table. The symbol group ids array was then used as input to the finite state transducer (FST). After the [recent refactor of the FST](#13344) lookup tables, the FST has become more flexible. It now supports arbitrary iterators and the symbol group id lookup table (that maps a symbol to a symbol group id) can now be implemented by a simple function object. This PR takes advantage of the FST's ability to take fancy iterators. We now zip the `json_input` and `stack_context` symbols and pass that `zip_iterator` to the FST. Authors: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #13716

#13344 introduced a performance regression to the FST benchmarks that showed as much as a 35% performance degradation. It seems that, after the refactor in the above PR, compiler optimization heuristics are deciding differently on loop unrolling in the part of the FST that's writing out transduced symbols. As a fix, we are enforcing to not unroll that loop. Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Karthikeyan (https://github.com/karthikeyann) - David Wendt (https://github.com/davidwendt) URL: #13850

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 12, 2023

elstehle mentioned this pull request May 12, 2023

[FEA] JSON validator for json strings given in strings column #12532

Closed

GregoryKimball mentioned this pull request Jun 7, 2023

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

elstehle force-pushed the feature/json-lines-recovery branch from 49680b1 to c0d5100 Compare June 13, 2023 07:17

github-actions bot added ci CMake CMake build issue Java Affects Java cuDF API. Python Affects Python cuDF API. labels Jun 13, 2023

elstehle added 2 commits June 13, 2023 00:17

adds option to recover from invalid JSON lines

357eca8

fixes format

5be6c14

elstehle force-pushed the feature/json-lines-recovery branch from c0d5100 to 5be6c14 Compare June 13, 2023 07:18

elstehle changed the base branch from branch-23.06 to branch-23.08 June 13, 2023 07:18

fixes namespace

7e48c35

github-actions bot removed Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. conda labels Jun 13, 2023

elstehle added 6 commits June 13, 2023 02:56

resolves merge conflicts

31c5cbe

adds fst for token post-processing

9b89685

Merge remote-tracking branch 'upstream/branch-23.08' into feature/jso…

16ca651

…n-lines-recovery

refactors fst and lookup tables, adds recovery mode

9cacebd

Merge remote-tracking branch 'upstream/branch-23.08' into feature/jso…

6cc468e

…n-lines-recovery

fixes format

9eed9c6

elstehle added non-breaking Non-breaking change feature request New feature or request cuIO cuIO issue labels Jun 27, 2023

elstehle added 3 commits July 7, 2023 01:26

adds documentation on lookup table factories

2391fa5

uses switch/case instead of if/else

5c9eccc

makes recovery_mode option an enum instead of bool

b3656bb

karthikeyann reviewed Jul 7, 2023

View reviewed changes

elstehle added 2 commits July 7, 2023 10:09

addresses review comments

bc78ff8

removes raw_ptr_cast

446ddbb

vuule approved these changes Jul 7, 2023

View reviewed changes

elstehle added 2 commits July 9, 2023 23:50

Merge remote-tracking branch 'upstream/branch-23.08' into feature/jso…

4b82e1f

…n-lines-recovery

renames enum option to recover from invalid lines

7e8d142

elstehle requested a review from karthikeyann July 10, 2023 14:21

GregoryKimball assigned elstehle Jul 10, 2023

ttnghia self-requested a review July 10, 2023 20:37

karthikeyann reviewed Jul 11, 2023

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Show resolved Hide resolved

karthikeyann self-requested a review July 11, 2023 18:30

elstehle added 2 commits July 11, 2023 23:49

Merge remote-tracking branch 'upstream/branch-23.08' into feature/jso…

777a940

…n-lines-recovery

clarifies that post-process requirements

442cb11

karthikeyann reviewed Jul 14, 2023

View reviewed changes

cpp/src/io/fst/lookup_tables.cuh Outdated Show resolved Hide resolved

elstehle added 3 commits July 13, 2023 23:21

Merge remote-tracking branch 'upstream/branch-23.08' into feature/jso…

19f231b

…n-lines-recovery

removes premap_op from translation table

34da916

removes premap_op from transition table

154ee00

karthikeyann approved these changes Jul 14, 2023

View reviewed changes

karthikeyann changed the title ~~[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer~~ [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer Jul 14, 2023

rapids-bot bot merged commit 2436e0b into rapidsai:branch-23.08 Jul 14, 2023
53 checks passed

elstehle mentioned this pull request Jul 18, 2023

Refactors JSON reader's pushdown automaton #13716

Merged

3 tasks

elstehle mentioned this pull request Aug 11, 2023

Fixes a performance regression in FST #13850

Merged

3 tasks

elstehle mentioned this pull request Oct 18, 2023

[BUG Table.readJson dropping valid JSON lines #14282

Closed

karthikeyann mentioned this pull request May 8, 2024

Enable get_token_stream to include LineEnd tokens with optional parameter. #15605

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344

[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344

elstehle commented May 12, 2023 •

edited

Loading

GregoryKimball commented May 15, 2023

elstehle commented May 16, 2023

vuule left a comment

vuule Jul 7, 2023

karthikeyann left a comment

karthikeyann left a comment

elstehle commented Jul 14, 2023

[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344

[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer #13344

Conversation

elstehle commented May 12, 2023 • edited Loading

Description

GregoryKimball commented May 15, 2023

elstehle commented May 16, 2023

vuule left a comment

Choose a reason for hiding this comment

vuule Jul 7, 2023

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

elstehle commented Jul 14, 2023

elstehle commented May 12, 2023 •

edited

Loading