Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid conversion to String in JsonReader, JsonNodeReader. #15693

Merged
merged 6 commits into from
Mar 26, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jan 16, 2024

These readers were running UTF-8 decode on the provided entity to convert it to a String, then parsing the String as JSON. The patch changes them to parse the provided entity's input stream directly.

In order to preserve the nice error messages that include parse errors, the readers now need to open the entity again on the error path, to re-read the data. To make this possible, the InputEntity#open contract is tightened to require the ability to re-open entities, and existing InputEntity implementations are updated to allow re-opening.

This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, updates it to benchmark all three JSON readers, and adds a case that reads fields out of the parsed row (not just creates it).

Benchmarks below. Findings:

  • The reader and node_reader (used if useJsonNodeReader is set) readers are ~22% faster on parseAndRead. reader is the default for stream ingest; node_reader is used for stream ingest if useJsonNodeReader is set.
  • The high variance in the patch branch with node_reader + discovery = true happened because my computer fell asleep while running that one. I didn't re-run it, because I felt the other results were compelling enough.
  • The line_reader wasn't changed in this patch and performance is similar (likely within margin of error). This one is default for batch ingest, and used for streaming if assumeNewlineDelimited is set.

So, the speedups are mainly for streaming ingest. But #15681 has a similar speedup for line_reader if that's the one you care about!

master

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  3382.119 ± 44.382  ns/op
JsonInputFormatBenchmark.parseAndRead        false         node_reader  avgt   10  3417.577 ± 31.790  ns/op
JsonInputFormatBenchmark.parseAndRead        false         line_reader  avgt   10  2914.081 ± 13.876  ns/op
JsonInputFormatBenchmark.parseAndRead         true              reader  avgt   10  3517.794 ±  5.322  ns/op
JsonInputFormatBenchmark.parseAndRead         true         node_reader  avgt   10  3527.872 ± 44.065  ns/op
JsonInputFormatBenchmark.parseAndRead         true         line_reader  avgt   10  2891.340 ± 19.949  ns/op

patch

Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score     Error  Units
JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2665.019 ±  16.666  ns/op
JsonInputFormatBenchmark.parseAndRead        false         node_reader  avgt   10  2732.038 ±   4.243  ns/op
JsonInputFormatBenchmark.parseAndRead        false         line_reader  avgt   10  2853.801 ±   4.083  ns/op
JsonInputFormatBenchmark.parseAndRead         true              reader  avgt   10  2734.092 ±   8.940  ns/op
JsonInputFormatBenchmark.parseAndRead         true         node_reader  avgt   10  2773.722 ± 200.142  ns/op
JsonInputFormatBenchmark.parseAndRead         true         line_reader  avgt   10  2845.418 ±   2.600  ns/op

These readers were running UTF-8 decode on the provided entity to
convert it to a String, then parsing the String as JSON. The patch
changes them to parse the provided entity's input stream directly.

In order to preserve the nice error messages that include parse errors,
the readers now need to open the entity again on the error path, to
re-read the data. To make this possible, the InputEntity#open contract
is tightened to require the ability to re-open entities, and existing
InputEntity implementations are updated to allow re-opening.

This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark,
updates it to benchmark all three JSON readers, and adds a case that reads
fields out of the parsed row (not just creates it).
@gianm
Copy link
Contributor Author

gianm commented Jan 16, 2024

This patch is running into a problem that is addressed in #15681 by the addition of IntermediateRowParsingReader#intermediateRowAsString. Without that, the switch to using ByteEntity as the intermediate form makes the error messages show "ByteEntity" rather than the actual intermediate data.

So, when that other one is merged, I'll merge master into this PR and it should be OK.

Copy link

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 19, 2024
@gianm gianm removed the stale label Mar 20, 2024
Renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, and enhances it to
test various readers (JsonReader, JsonLineReader, JsonNodeReader) as well as
to test with/without field discovery.
@gianm gianm merged commit 58a8a23 into apache:master Mar 26, 2024
86 checks passed
@gianm gianm deleted the json-avoid-string branch March 26, 2024 15:16
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants