Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist sub-rules of Transforms #211

Open
ghnp5 opened this issue Sep 4, 2020 · 5 comments
Open

Blacklist sub-rules of Transforms #211

ghnp5 opened this issue Sep 4, 2020 · 5 comments

Comments

@ghnp5
Copy link

ghnp5 commented Sep 4, 2020

Hello,

Is there a way to disable sub-rules of transforms, without disabling the whole transform?

For example, I don't want to disable all "charEscapeUnescape", but only the sub-rule for "]".

/\[quote\]/

For readability, I don't want to optimize to /\[quote]/.
But fo all other unnecessary escapes, I want optimizations.

Also, similar for "charClassToMeta". Is there a way I could disable only the conversion from [0-9a-z] to [\da-z] ?

Thank you!

EDIT - this might be related to #208

@DmitrySoshnikov
Copy link
Owner

@ghnp5, indeed, it might be a part of #208.

I think the optimize method may accept perTransformOptions map, and each individual transform will be able to work with its specific options.

In particular, the caller may look like:

regexpTree.optimize(/\[quote\]/, {
  perTransformOptions: {
    
    charEscapeUnescape: {
      // Avoid rewriting /\[quote\]/ as /\[quote]/
      excludedChars: /[\[\]]/,
    },

    charClassToMeta: {
      // Avoid rewriting [0-9a-z] as [\da-z]
      excludedClasses: [/[0-9]/],
    }, 
  }
});

Then the corresponding transforms have to be updated to accept those options and handle. A caveat: such granular checks and exclusions may slow down transforms.

I may take a look into this, and will appreciate a PR on this too in case you'll reach it earlier than me.

In addition, to faster unblock yourself, you can just write your own extra transform which translates \d+ back to [0-9]

@b-fett
Copy link

b-fett commented Sep 5, 2023

also transformation from [0-9] to [\d] is not safe as they aren't equivalent

@DmitrySoshnikov
Copy link
Owner

@b-fett is it Perl-specific or universal? We might need to start introducing --unsafe or --safe parameter which will take care of specific regexp rules.

@b-fett
Copy link

b-fett commented Sep 7, 2023

there are many cases

gpt says the next:

In most programming languages, the regular expression \d is equivalent to [0-9] and matches any single digit from 0 to 9. However, the behavior can change when dealing with Unicode characters in some languages. Here's a brief overview:            

 1 JavaScript: As mentioned earlier, when the Unicode flag (u) is used, \d can match any character that's considered a digit in the Unicode standard, which includes digit characters from other languages. [0-9] will only match ASCII digits.        
 2 Python: Python's re module has a UNICODE flag. When this flag is set, \d will match any Unicode digit from any script. Without the flag, \d is equivalent to [0-9].                                                                                 
 3 Java: In Java, \d matches any digit from any script (not just ASCII), because Java regular expressions are Unicode-based by default. [0-9] will only match ASCII digits.                                                                            
 4 Ruby: Ruby's regular expressions are also Unicode-based by default, so \d will match any Unicode digit from any script. [0-9] will only match ASCII digits.                                                                                         
 5 Perl: Perl's behavior is similar to Python's. \d will match any Unicode digit from any script when the use utf8; directive is in effect. Without it, \d is equivalent to [0-9].                                                                     
 6 PHP: PHP's preg functions are Unicode-aware. \d will match any Unicode digit from any script when the u modifier is used. Without the modifier, \d is equivalent to [0-9].                                                                          

In general, if you're dealing with a programming language or regular expression engine that supports Unicode, and you want to match only ASCII digits, you should use [0-9]. If you want to match any digit character, including digit characters from 
other languages, you should use \d.

@DmitrySoshnikov
Copy link
Owner

@b-fett thanks, I think we can disable this specific transform if /u flag is set. Feel free to submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants