improved BDD Unicode table representation in NonBacktracking engine #61142

veanes · 2021-11-03T09:23:37Z

Main updates:

Updated BDD table serialization to be based on byte[] instead of long[] for saving serialization space used for these arrays. Overall this cut space requirements by at least half.
Removed the table for \w, instead deriving it from the 8 Unicode categories 0,1,2,3,4,5,8,18
Made the generation algorithm of the BDD tables for ignore-case at least 2x faster if this would be used dynamically -- further optimization are probably possible, this change was using direct improvements involving better use of BDD operations.
Limited CharSetSolver._charPredTable to ASCII only as it is almost never used for NonASCII but took up128kB space for all Unicode chars but essentially for no good reason.

ghost · 2021-11-03T09:23:45Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Main updates:

Updated BDD table serialization to be based on byte[] instead of long[] for saving serialization space used for these arrays. Overall this cut space requirements by at least half.
Removed the table for \w, instead deriving it from the 8 Unicode categories 0,1,2,3,4,5,8,18
Made the generation algorithm of the BDD tables for ignore-case at least 2x faster if this would be used dynamically -- further optimization are probably possible, this change was using direct improvements involving better use of BDD operations.
Limited CharSetSolver._charPredTable to ASCII only as it is almost never used for NonASCII but took up128kB space for all Unicode chars but essentially for no good reason.

Author:	veanes
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

danmoseley · 2021-11-03T13:55:29Z

Test failure is a crash in JSON. Unrelated
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-61142-merge-76dd86c53f534a6f9c/System.Text.Json.Tests/1/console.bbc35fce.log?sv=2019-07-07&se=2021-11-23T10%3A07%3A57Z&sr=c&sp=rl&sig=NHWXxAo4YJUDwUFtnUwhMSdlCDFSVB6OdRUeITh7XAc%3D

...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/BDD.cs

danmoseley · 2021-11-03T13:58:39Z

#58828

...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/BDD.cs

...ext.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/CharSetSolver.cs

stephentoub · 2021-11-03T14:01:38Z

...ext.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/CharSetSolver.cs

+            BDD bdd = BDD.True;
+            for (int k = 0; k < 16; k++)
+            {
+                bdd = (c & (1 << k)) == 0 ? GetOrCreateBDD(k, BDD.False, bdd) : GetOrCreateBDD(k, bdd, BDD.False);


Are there cheaper ways to build up a BDD? Maybe the caching involved helps, but it seems like otherwise this is going to incrementally build up the BDD by creating 15 intermediate ones that are then thrown away?

There are better ways, I think, but this would involve using e.g. a designated array and non-object base representation with own memory-management over that array.
However this incremental build only happens once per ASCII character, I think it is negligible.

...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/BDD.cs

Co-authored-by: Dan Moseley <danmose@microsoft.com>

Co-authored-by: Stephen Toub <stoub@microsoft.com>

Co-authored-by: Dan Moseley <danmose@microsoft.com>

ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 3, 2021

dotnet-issue-labeler bot added area-System.Text.RegularExpressions and removed community-contribution Indicates that the PR has been added by a community member labels Nov 3, 2021

stephentoub reviewed Nov 3, 2021

View reviewed changes

...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/BDD.cs Show resolved Hide resolved

danmoseley reviewed Nov 3, 2021

View reviewed changes

...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/BDD.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 3, 2021

View reviewed changes

...ext.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/CharSetSolver.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 3, 2021

View reviewed changes

stephentoub approved these changes Nov 3, 2021

View reviewed changes

danmoseley reviewed Nov 3, 2021

View reviewed changes

...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Algebras/BDD.cs Outdated Show resolved Hide resolved

veanes force-pushed the updateUnicodeBDDs branch from edd1b5a to f67d79c Compare November 3, 2021 18:56

veanes and others added 5 commits November 4, 2021 13:20

improved BDD Unicode table representation in NonBacktracking engine

22fa30f

remove line

f8263d1

Co-authored-by: Dan Moseley <danmose@microsoft.com>

improved bounds-check elimination

14f18ad

Co-authored-by: Stephen Toub <stoub@microsoft.com>

clearer notation of numbers

5e06758

Co-authored-by: Dan Moseley <danmose@microsoft.com>

fixed typo

a4c1e04

veanes force-pushed the updateUnicodeBDDs branch from 6d154a4 to a4c1e04 Compare November 4, 2021 20:21

veanes merged commit b67f978 into dotnet:main Nov 4, 2021

veanes deleted the updateUnicodeBDDs branch November 4, 2021 23:02

ghost locked as resolved and limited conversation to collaborators Dec 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved BDD Unicode table representation in NonBacktracking engine #61142

improved BDD Unicode table representation in NonBacktracking engine #61142

veanes commented Nov 3, 2021

ghost commented Nov 3, 2021

danmoseley commented Nov 3, 2021

danmoseley commented Nov 3, 2021

stephentoub Nov 3, 2021

veanes Nov 3, 2021

improved BDD Unicode table representation in NonBacktracking engine #61142

improved BDD Unicode table representation in NonBacktracking engine #61142

Conversation

veanes commented Nov 3, 2021

ghost commented Nov 3, 2021

danmoseley commented Nov 3, 2021

danmoseley commented Nov 3, 2021

stephentoub Nov 3, 2021

Choose a reason for hiding this comment

veanes Nov 3, 2021

Choose a reason for hiding this comment