Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]JSON reader: support "allowBackslashEscapingAnyCharacter" #4616

Open
Tracked by #9
GaryShen2008 opened this issue Jan 24, 2022 · 4 comments
Open
Tracked by #9

[FEA]JSON reader: support "allowBackslashEscapingAnyCharacter" #4616

GaryShen2008 opened this issue Jan 24, 2022 · 4 comments
Labels
task Work required that improves the product but is not user facing

Comments

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Jan 24, 2022

From the Spark JSON option,
allowBackslashEscapingAnyCharacter
Allows accepting quoting of all character using backslash quoting mechanism.

typically only a few chars in the JSON standard are allowed.
In CUDF they only support ", , \t, \r, and \b. Not sure if there are others in JSON or not. Also not sure what happens if CUDF if others are encountered vs in Spark.

We need to figure out the set of characters which should be supported.

Because of how cuda works it is very rare for CUDF to return an error if it sees an escape it does not understand. We are likely just going to have to document the differences.

@GaryShen2008 GaryShen2008 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 24, 2022
@sameerz sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 25, 2022
@nartal1 nartal1 self-assigned this Jan 27, 2022
@nartal1
Copy link
Collaborator

nartal1 commented Jan 27, 2022

Spark supports these escape characters : \", \, \/, \b, \f, \n, \r, \t, \uXXXX.
CUDF supports these for now: \", \, \t, \r and \b.
Other than these, when allowBackslashEscapingAnyCharacter option is set to true in Spark, any character can be escaped. For example: \$10 results to $10.

Currently CUDF doesn't throw error for the escape characters which are not supported. Output would be same as input.
Example: \nabc results to \nabc.

@nartal1
Copy link
Collaborator

nartal1 commented Feb 3, 2022

I think falling back to CPU when this option is set is the right way for now.
allowBackslashEscapingAnyCharacter is boolean so we cannot partially support only those escape characters which are supported by CUDF.
So we can enable this only when other escape characters are supported(\n , \uXXXX etc) along with supporting escaping backslash for any other characters.

@nartal1 nartal1 removed their assignment Feb 3, 2022
@revans2
Copy link
Collaborator

revans2 commented Mar 14, 2024

Note that this is related to #10596

@revans2
Copy link
Collaborator

revans2 commented Mar 14, 2024

Also I just tested and \uXXXX appears to work properly out of the box with CUDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

4 participants