-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid conversion to String in JsonReader, JsonNodeReader. #15693
Conversation
These readers were running UTF-8 decode on the provided entity to convert it to a String, then parsing the String as JSON. The patch changes them to parse the provided entity's input stream directly. In order to preserve the nice error messages that include parse errors, the readers now need to open the entity again on the error path, to re-read the data. To make this possible, the InputEntity#open contract is tightened to require the ability to re-open entities, and existing InputEntity implementations are updated to allow re-opening. This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, updates it to benchmark all three JSON readers, and adds a case that reads fields out of the parsed row (not just creates it).
This patch is running into a problem that is addressed in #15681 by the addition of So, when that other one is merged, I'll merge master into this PR and it should be OK. |
This pull request has been marked as stale due to 60 days of inactivity. |
Renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, and enhances it to test various readers (JsonReader, JsonLineReader, JsonNodeReader) as well as to test with/without field discovery.
These readers were running UTF-8 decode on the provided entity to convert it to a String, then parsing the String as JSON. The patch changes them to parse the provided entity's input stream directly.
In order to preserve the nice error messages that include parse errors, the readers now need to open the entity again on the error path, to re-read the data. To make this possible, the InputEntity#open contract is tightened to require the ability to re-open entities, and existing InputEntity implementations are updated to allow re-opening.
This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, updates it to benchmark all three JSON readers, and adds a case that reads fields out of the parsed row (not just creates it).
Benchmarks below. Findings:
reader
andnode_reader
(used ifuseJsonNodeReader
is set) readers are ~22% faster onparseAndRead
.reader
is the default for stream ingest;node_reader
is used for stream ingest ifuseJsonNodeReader
is set.patch
branch withnode_reader
+discovery = true
happened because my computer fell asleep while running that one. I didn't re-run it, because I felt the other results were compelling enough.line_reader
wasn't changed in this patch and performance is similar (likely within margin of error). This one is default for batch ingest, and used for streaming ifassumeNewlineDelimited
is set.So, the speedups are mainly for streaming ingest. But #15681 has a similar speedup for
line_reader
if that's the one you care about!