Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for reading CSV and JSON floating-point values #4637

Merged
merged 13 commits into from
Feb 2, 2022

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Jan 26, 2022

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #124 and partially addresses #125, #126, and #1986 (only for floating-point types)

This PR affects both CSV and JSON readers and reads floating-point columns as strings and then casts them to a floating-point type. This means that we can now support special values such as NaN and Inf more consistently with Spark.

If this approach is acceptable then I will follow up with PRs to do this for other data types.

Note that this isn't perfect and there are still some follow-on issues (some existing and some new):

Out of these, issue 4647 bothers me the most.

Status:

  • Implement basic functionality
  • Enable related tests that were previously XFAIL
  • Fix resource leaks
  • Add new tests for better coverage of edge cases
  • Add support (and tests) for JSON option allowNonNumericNumbers
  • Add tests with ansi mode enabled for JSON and CSV with invalid inputs
  • Update compatibility guide

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great step. Ultimately I would like to use this same trick to support other types, like decimal and date/time. But I am not convinced that we want to make it common everywhere because JSON has some odd configs/requirements with how it parses some numbers, which will be different from CSV and from casting. But these are corner cases and it might not matter that much anyways.

On a side note @GaryShen2008 we need to make sure that we take this into account when we are gathering requirements for JSON parsing in CUDF. We already know that they are not going to want to provide all of the parsing options that Spark supports, so we probably want to instead work on supporting the parsing ourselves as much as we can.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry we should also update the compatibility doc so it explains that the parsing code has the same limitations as casting.

https://github.com/NVIDIA/spark-rapids/blob/branch-22.04/docs/compatibility.md#csv-floating-point

But reading it again, I think we want to do with for almost all types. Even integers to get the proper overflow checks.

Also can you add in a test for ANSI mode with invalid float values? I want to be sure that we are doing the same thing. Either both throw or not.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
docs/compatibility.md Outdated Show resolved Hide resolved
integration_tests/src/main/python/csv_test.py Show resolved Hide resolved
@andygrove andygrove changed the title WIP: Improve support for reading CSV and JSON floating-point values Improve support for reading CSV and JSON floating-point values Jan 28, 2022
@andygrove andygrove marked this pull request as ready for review January 28, 2022 22:37
-Inf
INF
-INF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nan and Inf values are already covered in nan_and_inf.csv

revans2
revans2 previously approved these changes Jan 31, 2022
@andygrove
Copy link
Contributor Author

build

jlowe
jlowe previously approved these changes Jan 31, 2022
@andygrove
Copy link
Contributor Author

build

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jan 31, 2022
@andygrove
Copy link
Contributor Author

There are test failures in HashAggregatesSuite when running against Spark 3.1.1 - I am investigating.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] CSV/JSON Parsing some float values results in overflow
4 participants