Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Regex match check #3779

Merged
merged 5 commits into from
Apr 2, 2024
Merged

Optimize Regex match check #3779

merged 5 commits into from
Apr 2, 2024

Conversation

HalidOdat
Copy link
Member

No description provided.

@HalidOdat HalidOdat added the performance Performance related changes and issues label Apr 2, 2024
@jedel1043
Copy link
Member

@raskad This is a pretty big optimization because it removes O(n^2) complexity from our match implementation, but we are not sure how to adapt this to be able to work on unicode matching. Can you give this a look?

@jedel1043 jedel1043 requested a review from raskad April 2, 2024 04:09
Copy link

github-actions bot commented Apr 2, 2024

Test262 conformance changes

Test result main count PR count difference
Total 50,268 50,268 0
Passed 42,773 42,773 0
Ignored 1,391 1,391 0
Failed 6,104 6,104 0
Panics 18 18 0
Conformance 85.09% 85.09% 0.00%

@HalidOdat
Copy link
Member Author

Reduced the failing tests from 48 to 4 unicode ones.

@HalidOdat
Copy link
Member Author

After some more debugging I think it's an issue with regresses InputUtf16 iterator, managed to get a minimal testcase. We are trying to match against the lower part of the unicode 𝌆 this should not be possible in unicode mode since it's unicode aware. (In non-unicode mode it matches as expected).

/// 262 test/built-ins/RegExp/prototype/exec/u-lastindex-adv.js
///
/// Test case:
///
/// ```JavaScript
/// assert.sameValue(/\udf06/u.exec('\ud834\udf06'), null);
/// ```
#[test]
fn utf16_correct_unicode_scan() {
    // '𝌆' This is "Tetragram For Centre"
    // See: https://www.compart.com/en/unicode/U+1D306
    const INPUT: &[u16] = &[0xd834, 0xdf06];
    const MATCHER: &[u16] = &[0xdf06];

    let regex = Regex::from_unicode(MATCHER.iter().copied().map(u32::from), Flags::from("u"))
        .expect("valid regex");
    let m = regex.find_from_utf16(INPUT, 0).next();

    println!("{m:#?}");

    assert!(m.is_none());
}

There is a match:

Some(
    Match {
        range: 1..2,
        captures: [],
        named_captures: {},
    },
)

core/engine/src/builtins/regexp/mod.rs Outdated Show resolved Hide resolved
core/engine/src/builtins/regexp/mod.rs Outdated Show resolved Hide resolved
@raskad
Copy link
Member

raskad commented Apr 2, 2024

The tests seem fine with the new regress version so I think the only thing missing is the todo comment

@HalidOdat
Copy link
Member Author

I put the comments this is ready for review/merge :)

@HalidOdat HalidOdat requested a review from a team April 2, 2024 20:13
@HalidOdat HalidOdat added this to the v0.18.1 milestone Apr 2, 2024
Copy link
Member

@raskad raskad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find and optimization!

@raskad raskad added this pull request to the merge queue Apr 2, 2024
Merged via the queue into main with commit b1f0780 Apr 2, 2024
13 checks passed
@jedel1043 jedel1043 deleted the optimize-regex branch April 2, 2024 22:29
@raskad raskad modified the milestones: v0.18.1, v0.19.0 Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related changes and issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants