Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test_unsupported_fallback_regexp_replace #10093

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion integration_tests/src/main/python/regexp_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
_regexp_conf = { 'spark.rapids.sql.regexp.enabled': True }

def mk_str_gen(pattern):
return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}')
return StringGen(pattern).with_special_case('').with_special_pattern(r'[^\\$]{0,10}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So backslashes and dollar signs can cause issues for the replacement string in regexp_replace on the CPU, but we fall back to the CPU properly. So the problem is really just with the data that we generated for this one test. Why then are we getting rid of backslashes and dollar signs for mk_str_gen by default instead of targeting it specifically to just the replacement string for regexp_replace?

i.e. on line 942 we have the following instead.

gen = StringGen('[abcdef]{0,2}').with_special_case('').withSpecialPattern(r'[^\\$]{0,10}')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mk_str_gen is used to generate regular expressions, the .with_special_pattern('.{0,10}') in it sometimes generates valid or invalid regexps and fails some tests.

For this issue, if adding some special cases (which can be generated by '.{0,10}') in it:

def mk_str_gen(pattern):
    return StringGen(pattern).with_special_case('').with_special_pattern('.{0,10}').with_special_case(('ab$cd', 100)).with_special_case(('ab\$cd', 100))

and run:

./integration_tests/run_pyspark_from_build.sh -k regexp

we will get:

===== 45 failed, 25 passed, 9 warnings in 67.17s (0:01:07) ======

So I think we can leave it by default in regexp_test for now, to at least avoid some random IT failures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think mk_str_gen is not very ideal when generating regex data, we might need a better way to generate random literal strings / valid regexps/ invalid regexps as test data.
filed #10098.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thirtiseven

There could be a confusion here. Bobby's comment refers to the local var definition inside the test function test_unsupported_fallback_regexp_replace

the diff should be something like

diff --git a/integration_tests/src/main/python/regexp_test.py b/integration_tests/src/main/python/regexp_test.py
index 018736f9c9..f015d05502 100644
--- a/integration_tests/src/main/python/regexp_test.py
+++ b/integration_tests/src/main/python/regexp_test.py
@@ -939,7 +939,7 @@ def test_unsupported_fallback_regexp_extract_all():
 
 @allow_non_gpu('ProjectExec', 'RegExpReplace')
 def test_unsupported_fallback_regexp_replace():
-    gen = mk_str_gen('[abcdef]{0,2}')
+    gen = StringGen('[abcdef]{0,2}').with_special_case('').with_special_pattern(r'[^\\$]{0,10}')
     regex_gen = StringGen(r'\[a-z\]\+')
     def assert_gpu_did_fallback(sql_text):
         assert_gpu_fallback_collect(lambda spark:

and I see 0 failures with this change for -k regexp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gerashegalov

Making the datagen inside the test function makes sense to me and will fix the issue with no failures. I'm trying to explain why I think it might be a general issue when generating regexps with mk_str_gen.

and I see 0 failures with this change for -k regexp.

My step is modifying the mk_str_gen inside the regexp_test.py file to add more special cases and run the -k regexp.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't generate regexps with mk_str_gen. Looking through regexp_test.py, every regexp I see is hard coded, except for test_unsupported_fallback_regexp_extract, test_unsupported_fallback_regexp_extract_all, and test_unsupported_fallback_regexp_replace which do not use mk_str_gen for the regexp. I don't see any gen_scalar* in there that would allow us to make a random scalar regular expression either. I think we should probably update those StringGens used to create the regular expressions so that they don't have any special cases in them. We don't need them to test what we are testing.

Copy link
Collaborator Author

@thirtiseven thirtiseven Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad, You are right, we don't generate regexps with mk_str_gen.

I tested with with_special_pattern in a wrong way.

Closed the issue and updated this fix inside the case.


def test_split_re_negative_limit():
data_gen = mk_str_gen('([bf]o{0,2}:){1,7}') \
Expand Down
Loading