Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON reader: Provide option to treat quoted strings as null values #10283

Closed
andygrove opened this issue Feb 14, 2022 · 8 comments
Closed
Assignees
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@andygrove
Copy link
Contributor

andygrove commented Feb 14, 2022

Is your feature request related to a problem? Please describe.
This is part of NVIDIA/spark-rapids#9

In order to be consistent with Spark when reading JSON on the GPU, we would like to ask cuDF to read non-string primitive values as strings and then cast them to the required type. This approach already works well for valid inputs but we do not have a way to treat quoted strings as null to match Spark's behavior.

Here is an example JSON input to demonstrate the problem.

{ "number": true }
{ "number": "true" }

The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.

Describe the solution you'd like
There are a few possible approaches to this:

  1. Have a way to read the raw value without any parsing, so the resulting column will include the quotes
  2. Add the ability to ask cuDF to read non-string values as strings but to interpret any values that are JSON strings (quoted) as null
  3. Return all the data as strings but also include a bitmask indicating which ones were quoted.

Describe alternatives you've considered
None

Additional context
None

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022
@revans2
Copy link
Contributor

revans2 commented May 16, 2022

Moved this to P1 as it is a corner case that is not that common. It would be very nice to be able to support this at some point sooner than later.

@vuule
Copy link
Contributor

vuule commented Jun 7, 2022

Have a way to read the raw value without any parsing, so the resulting column will include the quotes

In this case, is it okay to keep the quotes on the values in the actual string columns as well?

@vuule vuule added the cuIO cuIO issue label Jun 8, 2022
@andygrove
Copy link
Contributor Author

Have a way to read the raw value without any parsing, so the resulting column will include the quotes

In this case, is it okay to keep the quotes on the values in the actual string columns as well?

If we keep the quotes then we will have to perform an additional transformation in the plugin to remove them so this doesn't seem ideal.

If we can get the raw string value (without quotes) and an indication of whether the value was quoted or not then I think we have everything we need.

@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jun 28, 2022
@elstehle
Copy link
Contributor

elstehle commented Sep 6, 2022

Once #11574 is merged, the new nested JSON reader (currently available as experimental) will introduce an option to keep_quotes that will retain the quote characters on string values. Would this sufficiently address this feature request?

Otherwise, I would need to better understand the expected behaviour.

Here is an example JSON input to demonstrate the problem.

{ "number": true }
{ "number": "true" }

The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.

Is this a mixup, or would spark really treat the second value as null?

I think having a mapping of a tuple of (target_type, JSON_type) -> {valid, invalid}, where for JSON_type we distinguish between {string-value, non-string value}.

@revans2
Copy link
Contributor

revans2 commented Sep 8, 2022

Yes keep_quotes would do what we want.

@vuule
Copy link
Contributor

vuule commented Sep 27, 2022

@revans2 the keep_quotes option is now merged. Can we close this issue? We can always reopen if the implementation is not sufficient.

@revans2
Copy link
Contributor

revans2 commented Sep 30, 2022

Sure

@revans2 revans2 closed this as completed Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants