Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add ability to read JSON structs as strings, or specify struct schema #14830

Closed
Tracked by #9458
andygrove opened this issue Jan 22, 2024 · 2 comments
Closed
Tracked by #9458
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@andygrove
Copy link
Contributor

andygrove commented Jan 22, 2024

Is your feature request related to a problem? Please describe.
In the Spark RAPIDS plugin, we typically want to read JSON primitives as strings and then cast to the required type in the plugin, to ensure compatibility with Spark.

This works for top-level primitives in a JSON file. However, there doesn't seem to be a way to specify the data types of fields within a struct.

Here is an example input file where I would like to read fields b and c as strings.

{ "a": { "b": 123 }, "c": 321 }
{ "a": { "b": 456 }, "c": 654 }

Here is some Java code for reading this file.

    JSONOptions opts = JSONOptions.builder()
            .withLines(true)
            .build();

    Schema schema = Schema.builder()
            .column(DType.STRUCT, "a")
            .column(DType.STRING, "c")
            .build();

    Table table = Table.readJSON(schema, opts, TEST_NESTED_JSON);

    ColumnVector a = table.getColumn(0);
    ColumnView b = a.getChildColumnView(0);
    ColumnVector c = table.getColumn(1);

    System.out.println("a = " + a.type);
    System.out.println("b = " + b.type);
    System.out.println("c = " + c.type);

The output is:

a = STRUCT
b = INT64
c = STRING

cuDF has inferred the type of column b and there seems to be no way for me to specify to read this as a string instead of int64.

Describe the solution you'd like
There are two possible solutions:

  • Add the ability to specify struct types fully.
  • Add an option for reading structs as unparsed strings and then parse the JSON string in the plugin. This would be similar to the recently added support for reading mixed types as string. The API for this could be one of the following:
    • Specify the type STRING rather than STRUCT for the column
    • Add a new JSON reader option structs_as_strings

Describe alternatives you've considered

Additional context

@karthikeyann
Copy link
Contributor

karthikeyann commented Jan 23, 2024

Specifiying nested type data type is available in libcudf json_reader_options. It is exposed as an array of dtypes in JNI (jintArray j_types).
The interface should be updated to allow nested specification of columns (cudf::io::schema_element).

@revans2
Copy link
Contributor

revans2 commented Feb 8, 2024

Not sure why the linking didn't work, but #14954 fixed this issue.

@revans2 revans2 closed this as completed Feb 8, 2024
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

When branches are created from issues, their pull requests are automatically linked.

4 participants