Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzing test for JSON reader #5001

Merged
merged 24 commits into from
Apr 18, 2022
Merged

Conversation

HaoYang670
Copy link
Collaborator

@HaoYang670 HaoYang670 commented Mar 22, 2022

Add fuzz tests for JSON reading.
close #4138

This PR contains 3 parts:

  1. Schema generator: randomly generate a JSON schema described in PySpark's DataType
  2. JSON generator: randomly generator a JSON string based on the given schema
  3. test_json_read_fuzz: test the behavior of CPU and GPU on reading a random json file (marked Xfail, so it will not impact the CI job)

Data Type supported so far:

  1. primitive types: int, long, float, double, string
  2. nested types: struct and array

Plan to support in the future:

  1. primitive types: date, timestamp
  2. nested types: map

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

'spark.rapids.sql.format.json.read.enabled': 'true'}

@approximate_float
@pytest.mark.xfail(reason = "fuzz test")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be at least a second test that is always expected to succeed?

Naively I am also not sure why this test would be expected to fail. I guess the other point to make, would be to add a github issue link on what is failing. Perhaps that's well known to others, but it isn't obvious from this change to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim of this fuzzy test is to help find more corner cases in JSON reading. Link: #4821

Initially, my thought was that it would spoil our CI tests if a corner case was found. So I marked it as xfail. However, after thinking carefully about your feedback, I remove the xfail. Maybe we should let the CI fail when we find a new corner case, although the possibility is low.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sameerz sameerz added the test Only impacts tests label Mar 22, 2022
@sameerz sameerz added this to the Mar 21 - Apr 1 milestone Mar 22, 2022
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

_name_gen = StringGen(pattern= "[a-z]{1,30}",nullable= False)
_name_gen.start(random)
_string_gen = StringGen(pattern= "[a-z]{1,30}",nullable= False)
_string_gen.start(random)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests deterministic, using a fixed seed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

1 similar comment
@HaoYang670
Copy link
Collaborator Author

build

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

1 similar comment
@HaoYang670
Copy link
Collaborator Author

build

@pxLi
Copy link
Collaborator

pxLi commented Mar 25, 2022

please check the failed log before re-trigger,

Error: 3-25T06:40:13.649Z] [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.13:check (default) on project rapids-4-spark-parent: Too many files with unapproved license: 1 See RAT report in: /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-4260/target/spark24X/rat.txt -> [Help 1]

exclude related json filed at https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/pom.xml#L1154-L1178 to pass the license test

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build


@approximate_float
@allow_non_gpu('FileSourceScanExec')
@pytest.mark.xfail(reason="fuzz test may randomly fail")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear fuzz testing is intended to find issues and should not be running by default as a part of integration tests. It should be something that we can enable and run for a configurable number of iterations to find problems. Once we find the problem we need to have a way to debug/reproduce this issue. Then we can file a follow on issue/bug to fix it.

I would much rather have us skip this test by default unless a flag is set to enable fuzz testing, preferably with a number of iterations. With xfail we want to get to the point where if a test passes when we expect it to fail then it will be marked as a failure, xfail_strict=true. Because this is totally random I really don't want to mark it as xfail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Fuzz tests are skipped by defualt.
We can enable fuzz tests and store the tested data by running

> bash run_pyspark_from_build.sh -k test_json_read_fuzz --fuzz_test --debug_tmp_path

@allow_non_gpu('FileSourceScanExec')
@pytest.mark.xfail(reason="fuzz test may randomly fail")
def test_json_read_fuzz(spark_tmp_path):
depth = random.randint(1, 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we reproduce/debug an error when it happens? Everything is random and based off of a regular random. Typically in python we would want to create an instance of Random with a seed, and then pass it everywhere so the results can be reproduced. We do this in data_gen already.

I am fine if we randomly generate a seed based off of a timestamp or something like that. But at a minimum we need a way to log what that seed was so if/when it does fail we can debug things. With xdist this is not super simple, so I would suggest that we catch all Exceptions and then wrap them with something new that includes the seed that triggered the problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the flag --debug_tmp_path can save the test data

else:
yield chr(random.randint(0x61, 0x66))

def gen_number():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark also supports special cases such as +Infinity and NaN for floating-point numbers, even though these are not valid in the JSON spec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andygrove . Really nice reminder! I will add these and other types supported by Spark in following PRs.

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

### Enabling fuzz tests

Fuzz tests are intended to find more corner cases in testing. We disable them by default because they might randomly fail.
The tests can be enabled by appending th option `--fuzz_test` to the command.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You have "th" instead of "the"

The tests can be enabled by appending the option --fuzz_test to the command.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


* `--fuzz_test` (enable the fuzz tests when provided, and remove this option if you want to disable the tests)

To reproduce an error appearing in the fuzz tests, you also need to add the flag `--debug_tmp_path` to save the test data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we combine these two together? So if you pass in --fuzz_test_debug_path or something like that the tests are enabled? To be clear I am fine with keeping the debug_tmp_path as it is. If this is really hard you don't need to do it.


# A JSON generator built based on the context free grammar from https://www.json.org/json-en.html

from cgi import test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We are not doing CGI are we?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the IDE added this line automatically. Have removed!

fix spelling mistake

Signed-off-by: remzi <13716567376yh@gmail.com>
@sameerz sameerz modified the milestones: Mar 21 - Apr 1, Apr 4 - Apr 15 Apr 4, 2022
Co-authored-by: Niranjan Artal <50492963+nartal1@users.noreply.github.com>
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but would be good to get review from others who had reviewed earlier.

@HaoYang670
Copy link
Collaborator Author

build

@sameerz
Copy link
Collaborator

sameerz commented Apr 11, 2022

Does this close issue #4138 ?

@HaoYang670
Copy link
Collaborator Author

Does this close issue #4138 ?

Yes. Have updated.

@HaoYang670
Copy link
Collaborator Author

Hi @revans2. Do you think we could merge this PR or not?

@revans2
Copy link
Collaborator

revans2 commented Apr 18, 2022

@HaoYang670 I am very busy with other things. Please don't let me block this from going in. If others have approved it then we can merge it in and if there are issues we can iterate on them as we find them.

@andygrove andygrove merged commit 79e9793 into NVIDIA:branch-22.06 Apr 18, 2022
@HaoYang670 HaoYang670 deleted the json_fuzz branch April 19, 2022 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] JSON: fuzz testing
7 participants