-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use improved CUDF JSON validation #11464
Conversation
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I had one minor comment.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonReadCommon.scala
Outdated
Show resolved
Hide resolved
build |
@abellina please take another look |
build |
build |
@abellina please take another look. My changes to the text buffering accidentally filtered out too much for hive text files and our tests caught it. I made the change so it is now specific to JSON and added a test file to verify that the case when all of the lines are bogus or empty works as expected. |
override def createBufferer(estimatedSize: Long, | ||
lineSeparatorInRead: Array[Byte]): HostLineBufferer = | ||
new HostLineBufferer(estimatedSize, lineSeparatorInRead) | ||
new HostLineBufferer(estimatedSize, lineSeparatorInRead, true) | ||
} | ||
|
||
/** | ||
* Buffer the lines in a single HostMemoryBuffer with the separator inserted inbetween each of | ||
* the lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could you add a comment that HostLineBufferer is used for JSON and one for HostStringColBufferer that should be used for other formats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit.
This depends on rapidsai/cudf#15968 that was merged in recently
This fixes #10457
This fixes #10479
This fixes #10534
This fixes #10911
It also fixes some issue with JSON scan where if all of the input is invalid/empty lines before it would error out, but now our work around fixes that. It is not a final solution though.