Avoid conversion to String in JsonReader, JsonNodeReader. #15693

gianm · 2024-01-16T07:14:26Z

These readers were running UTF-8 decode on the provided entity to convert it to a String, then parsing the String as JSON. The patch changes them to parse the provided entity's input stream directly.

In order to preserve the nice error messages that include parse errors, the readers now need to open the entity again on the error path, to re-read the data. To make this possible, the InputEntity#open contract is tightened to require the ability to re-open entities, and existing InputEntity implementations are updated to allow re-opening.

This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, updates it to benchmark all three JSON readers, and adds a case that reads fields out of the parsed row (not just creates it).

Benchmarks below. Findings:

The reader and node_reader (used if useJsonNodeReader is set) readers are ~22% faster on parseAndRead. reader is the default for stream ingest; node_reader is used for stream ingest if useJsonNodeReader is set.
The high variance in the patch branch with node_reader + discovery = true happened because my computer fell asleep while running that one. I didn't re-run it, because I felt the other results were compelling enough.
The line_reader wasn't changed in this patch and performance is similar (likely within margin of error). This one is default for batch ingest, and used for streaming if assumeNewlineDelimited is set.

So, the speedups are mainly for streaming ingest. But #15681 has a similar speedup for line_reader if that's the one you care about!

master

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  3382.119 ± 44.382  ns/op
JsonInputFormatBenchmark.parseAndRead        false         node_reader  avgt   10  3417.577 ± 31.790  ns/op
JsonInputFormatBenchmark.parseAndRead        false         line_reader  avgt   10  2914.081 ± 13.876  ns/op
JsonInputFormatBenchmark.parseAndRead         true              reader  avgt   10  3517.794 ±  5.322  ns/op
JsonInputFormatBenchmark.parseAndRead         true         node_reader  avgt   10  3527.872 ± 44.065  ns/op
JsonInputFormatBenchmark.parseAndRead         true         line_reader  avgt   10  2891.340 ± 19.949  ns/op

patch

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score     Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2665.019 ±  16.666  ns/op
JsonInputFormatBenchmark.parseAndRead        false         node_reader  avgt   10  2732.038 ±   4.243  ns/op
JsonInputFormatBenchmark.parseAndRead        false         line_reader  avgt   10  2853.801 ±   4.083  ns/op
JsonInputFormatBenchmark.parseAndRead         true              reader  avgt   10  2734.092 ±   8.940  ns/op
JsonInputFormatBenchmark.parseAndRead         true         node_reader  avgt   10  2773.722 ± 200.142  ns/op
JsonInputFormatBenchmark.parseAndRead         true         line_reader  avgt   10  2845.418 ±   2.600  ns/op

These readers were running UTF-8 decode on the provided entity to convert it to a String, then parsing the String as JSON. The patch changes them to parse the provided entity's input stream directly. In order to preserve the nice error messages that include parse errors, the readers now need to open the entity again on the error path, to re-read the data. To make this possible, the InputEntity#open contract is tightened to require the ability to re-open entities, and existing InputEntity implementations are updated to allow re-opening. This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, updates it to benchmark all three JSON readers, and adds a case that reads fields out of the parsed row (not just creates it).

gianm · 2024-01-16T09:09:01Z

This patch is running into a problem that is addressed in #15681 by the addition of IntermediateRowParsingReader#intermediateRowAsString. Without that, the switch to using ByteEntity as the intermediate form makes the error messages show "ByteEntity" rather than the actual intermediate data.

So, when that other one is merged, I'll merge master into this PR and it should be OK.

github-actions · 2024-03-19T00:16:30Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

Renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, and enhances it to test various readers (JsonReader, JsonLineReader, JsonNodeReader) as well as to test with/without field discovery.

gianm added the Performance label Jan 16, 2024

github-actions bot added the Area - Ingestion label Jan 16, 2024

Fixes for static analysis.

b5f530a

gianm added 2 commits January 18, 2024 03:45

Merge branch 'master' into json-avoid-string

9e7ae5b

Implement intermediateRowAsString in JsonReader.

a7552a9

github-actions bot added the stale label Mar 19, 2024

gianm removed the stale label Mar 20, 2024

Merge branch 'master' into json-avoid-string

0aeca82

clintropolis approved these changes Mar 22, 2024

View reviewed changes

Enhanced JsonInputFormatBenchmark.

624d3b9

Renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, and enhances it to test various readers (JsonReader, JsonLineReader, JsonNodeReader) as well as to test with/without field discovery.

github-actions bot added the Area - Dependencies label Mar 22, 2024

gianm mentioned this pull request Mar 22, 2024

Enhanced JsonInputFormatBenchmark. #16188

Closed

gianm merged commit 58a8a23 into apache:master Mar 26, 2024
86 checks passed

gianm deleted the json-avoid-string branch March 26, 2024 15:16

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid conversion to String in JsonReader, JsonNodeReader. #15693

Avoid conversion to String in JsonReader, JsonNodeReader. #15693

gianm commented Jan 16, 2024 •

edited

Loading

gianm commented Jan 16, 2024

github-actions bot commented Mar 19, 2024

Avoid conversion to String in JsonReader, JsonNodeReader. #15693

Avoid conversion to String in JsonReader, JsonNodeReader. #15693

Conversation

gianm commented Jan 16, 2024 • edited Loading

gianm commented Jan 16, 2024

github-actions bot commented Mar 19, 2024

gianm commented Jan 16, 2024 •

edited

Loading