Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Optimize FindFirstCharToEncode for JavaScriptEncoder.Default using Ssse3 intrinsics #42073

Merged
merged 36 commits into from
Nov 3, 2019
Merged

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Oct 23, 2019

Description

Follow up to #41933

These optimization are applied to JavaScriptEncoder.Default (= DefaultJavaScriptEncoderBasicLatin), as suggested by @ahsonkhan in #41845 (comment).

For the other two mentioned encoders, the bit-mask won't work (in a performant way), as some values above 0x7F are valid. So to handle them properly, lots of overhead have to be introduced, that's why I left them out (in this PR, maybe later when I can think more about it).

Benchmarks

Benchmarks are based on the ones from #41845

Machine info

A linux-vm was used.

BenchmarkDotNet=v0.11.5, OS=ubuntu 18.04
Intel Xeon CPU 2.00GHz, 1 CPU, 2 logical cores and 1 physical core
Test_EscapingBenchmark
|   Method |        TestStringData |      Mean |     Error |    StdDev | Ratio | RatioSD |
|--------- |---------------------- |----------:|----------:|----------:|------:|--------:|
|   Master |            (1, -1, r) |  5.660 ns | 0.1006 ns | 0.0941 ns |  1.00 |    0.00 |
| PR_42073 |            (1, -1, r) |  5.318 ns | 0.0420 ns | 0.0393 ns |  0.94 |    0.02 |
|          |                       |           |           |           |       |         |
|   Master |          (1, 0, <) |  5.072 ns | 0.0081 ns | 0.0071 ns |  1.00 |    0.00 |
| PR_42073 |          (1, 0, <) |  5.113 ns | 0.0132 ns | 0.0123 ns |  1.01 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master | (100,(...)kuwu) [111] | 48.072 ns | 0.4432 ns | 0.3929 ns |  1.00 |    0.00 |
| PR_42073 | (100,(...)kuwu) [111] | 24.007 ns | 0.0502 ns | 0.0419 ns |  0.50 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |  (15, (...)dabu) [25] | 21.191 ns | 0.0343 ns | 0.0268 ns |  1.00 |    0.00 |
| PR_42073 |  (15, (...)dabu) [25] | 13.933 ns | 0.0442 ns | 0.0391 ns |  0.66 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |  (16, (...)sqfa) [26] | 14.327 ns | 0.0210 ns | 0.0186 ns |  1.00 |    0.00 |
| PR_42073 |  (16, (...)sqfa) [26] |  8.877 ns | 0.0458 ns | 0.0428 ns |  0.62 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |  (17, (...)aabr) [27] | 16.068 ns | 0.0447 ns | 0.0373 ns |  1.00 |    0.00 |
| PR_42073 |  (17, (...)aabr) [27] | 10.433 ns | 0.0337 ns | 0.0299 ns |  0.65 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |           (2, -1, dd) |  6.631 ns | 0.0198 ns | 0.0175 ns |  1.00 |    0.00 |
| PR_42073 |           (2, -1, dd) |  6.622 ns | 0.0323 ns | 0.0302 ns |  1.00 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |  (32, (...)dfpn) [42] | 20.296 ns | 0.2466 ns | 0.2186 ns |  1.00 |    0.00 |
| PR_42073 |  (32, (...)dfpn) [42] | 10.287 ns | 0.2252 ns | 0.2107 ns |  0.51 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |         (4, -1, negs) |  9.686 ns | 0.0922 ns | 0.0863 ns |  1.00 |    0.00 |
| PR_42073 |         (4, -1, negs) |  9.128 ns | 0.0523 ns | 0.0463 ns |  0.94 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |      (7, -1, netggni) | 14.083 ns | 0.0461 ns | 0.0385 ns |  1.00 |    0.00 |
| PR_42073 |      (7, -1, netggni) | 14.761 ns | 0.0426 ns | 0.0356 ns |  1.05 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 0, <etggni) |  5.290 ns | 0.0194 ns | 0.0172 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 0, <etggni) |  5.101 ns | 0.0280 ns | 0.0262 ns |  0.96 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 1, n<tggni) |  6.439 ns | 0.0443 ns | 0.0414 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 1, n<tggni) |  6.427 ns | 0.0451 ns | 0.0400 ns |  1.00 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 2, ne<ggni) |  8.023 ns | 0.0340 ns | 0.0301 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 2, ne<ggni) |  7.958 ns | 0.1538 ns | 0.1363 ns |  0.99 |    0.02 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 3, net<gni) |  9.177 ns | 0.0694 ns | 0.0615 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 3, net<gni) |  8.808 ns | 0.0230 ns | 0.0192 ns |  0.96 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 4, netg<ni) | 10.564 ns | 0.0143 ns | 0.0120 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 4, netg<ni) | 10.590 ns | 0.0143 ns | 0.0127 ns |  1.00 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 5, netgg<i) | 12.577 ns | 0.0185 ns | 0.0154 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 5, netgg<i) | 11.106 ns | 0.0306 ns | 0.0271 ns |  0.88 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 6, netggn<) | 14.034 ns | 0.0385 ns | 0.0341 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 6, netggn<) | 12.251 ns | 0.0163 ns | 0.0145 ns |  0.87 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |     (8, -1, jgnavpkd) | 10.966 ns | 0.0180 ns | 0.0150 ns |  1.00 |    0.00 |
| PR_42073 |     (8, -1, jgnavpkd) |  9.062 ns | 0.0167 ns | 0.0148 ns |  0.83 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 0, <gnavpkd) | 11.446 ns | 0.0321 ns | 0.0301 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 0, <gnavpkd) |  8.655 ns | 0.0157 ns | 0.0140 ns |  0.76 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 1, j<navpkd) | 10.585 ns | 0.0197 ns | 0.0164 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 1, j<navpkd) |  8.674 ns | 0.0202 ns | 0.0189 ns |  0.82 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 2, jg<avpkd) | 10.568 ns | 0.0133 ns | 0.0118 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 2, jg<avpkd) |  8.663 ns | 0.0235 ns | 0.0220 ns |  0.82 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 3, jgn<vpkd) | 10.583 ns | 0.0154 ns | 0.0144 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 3, jgn<vpkd) |  8.654 ns | 0.0115 ns | 0.0102 ns |  0.82 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 4, jgna<pkd) | 10.986 ns | 0.0200 ns | 0.0187 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 4, jgna<pkd) |  8.651 ns | 0.0104 ns | 0.0097 ns |  0.79 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 5, jgnav<kd) | 10.579 ns | 0.0145 ns | 0.0121 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 5, jgnav<kd) |  8.652 ns | 0.0099 ns | 0.0093 ns |  0.82 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 6, jgnavp<d) | 10.986 ns | 0.0350 ns | 0.0311 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 6, jgnavp<d) |  8.656 ns | 0.0127 ns | 0.0112 ns |  0.79 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 7, jgnavpk<) | 10.963 ns | 0.0139 ns | 0.0116 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 7, jgnavpk<) |  8.660 ns | 0.0096 ns | 0.0085 ns |  0.79 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (9, -1, csvobsdxs) | 11.934 ns | 0.0233 ns | 0.0218 ns |  1.00 |    0.00 |
| PR_42073 |    (9, -1, csvobsdxs) | 10.200 ns | 0.0113 ns | 0.0100 ns |  0.86 |    0.00 |
Perf.Basic

Master

|          Method | Formatted | SkipValidation | DataSize |     Mean |     Error |    StdDev |   Median |      Min |      Max |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------- |---------- |--------------- |--------- |---------:|----------:|----------:|---------:|---------:|---------:|-------:|------:|------:|----------:|
| WriteBasicUtf16 |     False |           True |       10 | 1.391 us | 0.0156 us | 0.0146 us | 1.388 us | 1.367 us | 1.419 us | 0.0056 |     - |     - |     120 B |

PR_42073 (this one)

|          Method | Formatted | SkipValidation | DataSize |     Mean |     Error |    StdDev |   Median |      Min |      Max |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------- |---------- |--------------- |--------- |---------:|----------:|----------:|---------:|---------:|---------:|-------:|------:|------:|----------:|
| WriteBasicUtf16 |     False |           True |       10 | 1.316 us | 0.0034 us | 0.0026 us | 1.315 us | 1.311 us | 1.320 us | 0.0052 |     - |     - |     120 B |

EscapingWriter

Master

|               Method | DataLength | NegativeIndex |      Mean |    Error |   StdDev |    Median |       Min |       Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------------- |----------- |-------------- |----------:|---------:|---------:|----------:|----------:|----------:|------:|------:|------:|----------:|
| NeedsEscapingCurrent |         32 |            -1 |  89.13 us | 1.886 us | 1.764 us |  88.46 us |  87.40 us |  92.88 us |     - |     - |     - |         - |
| NeedsEscapingCurrent |         32 |            12 | 208.18 us | 0.407 us | 0.340 us | 208.17 us | 207.67 us | 208.94 us |     - |     - |     - |         - |

PR_42073 (this one)

|               Method | DataLength | NegativeIndex |      Mean |    Error |   StdDev |    Median |       Min |       Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------------- |----------- |-------------- |----------:|---------:|---------:|----------:|----------:|----------:|------:|------:|------:|----------:|
| NeedsEscapingCurrent |         32 |            -1 |  74.71 us | 0.337 us | 0.263 us |  74.66 us |  74.29 us |  75.16 us |     - |     - |     - |         - |
| NeedsEscapingCurrent |         32 |            12 | 203.55 us | 1.189 us | 1.054 us | 203.18 us | 202.71 us | 206.46 us |     - |     - |     - |         - |
WriteJson

Master

|                    Type |            Method |         Mean |        Error |       StdDev |       Median |          Min |          Max |   Gen 0 |   Gen 1 |   Gen 2 | Allocated |
|------------------------ |------------------ |-------------:|-------------:|-------------:|-------------:|-------------:|-------------:|--------:|--------:|--------:|----------:|
|          IndexViewModel | SerializeToString |  42,653.8 ns |    365.08 ns |    323.63 ns |  42,589.4 ns |  42,135.8 ns |  43,221.9 ns |  1.2052 |       - |       - |   25568 B |
|                Location | SerializeToString |   1,795.5 ns |     14.36 ns |     11.99 ns |   1,793.5 ns |   1,780.8 ns |   1,825.0 ns |  0.0290 |       - |       - |     584 B |
|          LoginViewModel | SerializeToString |     802.8 ns |     12.37 ns |     11.57 ns |     803.8 ns |     784.7 ns |     824.8 ns |  0.0161 |       - |       - |     344 B |
| MyEventsListerViewModel | SerializeToString | 921,083.1 ns | 11,547.56 ns | 10,801.59 ns | 918,623.4 ns | 898,673.1 ns | 941,143.2 ns | 44.1176 | 44.1176 | 44.1176 |  324336 B |

PR_42073 (this one)

|                    Type |            Method |         Mean |       Error |      StdDev |       Median |          Min |          Max |   Gen 0 |   Gen 1 |   Gen 2 | Allocated |
|------------------------ |------------------ |-------------:|------------:|------------:|-------------:|-------------:|-------------:|--------:|--------:|--------:|----------:|
|          IndexViewModel | SerializeToString |  39,136.4 ns |   176.95 ns |   147.76 ns |  39,128.9 ns |  38,954.2 ns |  39,407.3 ns |  1.2500 |       - |       - |   25608 B |
|                Location | SerializeToString |   1,787.5 ns |    11.78 ns |    10.44 ns |   1,784.1 ns |   1,775.6 ns |   1,815.9 ns |  0.0286 |       - |       - |     584 B |
|          LoginViewModel | SerializeToString |     754.7 ns |     8.18 ns |     6.83 ns |     752.6 ns |     748.7 ns |     771.6 ns |  0.0152 |       - |       - |     344 B |
| MyEventsListerViewModel | SerializeToString | 892,401.1 ns | 4,694.61 ns | 4,161.65 ns | 891,426.9 ns | 886,616.1 ns | 901,364.4 ns | 45.1389 | 45.1389 | 45.1389 |  324824 B |
SerializingNuget

Master

|            Method |    Mean |    Error |   StdDev |  Median |     Min |     Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |--------:|---------:|---------:|--------:|--------:|--------:|------:|------:|------:|----------:|
| SerializeToString | 1.043 s | 0.0049 s | 0.0044 s | 1.043 s | 1.035 s | 1.050 s |     - |     - |     - |  787.3 MB |

PR_42073 (this one)

|            Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |---------:|--------:|--------:|---------:|---------:|---------:|------:|------:|------:|----------:|
| SerializeToString | 955.2 ms | 5.17 ms | 4.32 ms | 955.1 ms | 946.8 ms | 961.5 ms |     - |     - |     - |  787.3 MB |

ahsonkhan and others added 22 commits October 16, 2019 18:11
…ith CreateEscapingMaskSse2 (renamed from CreateEscapingMask)
Somehow this reference was left there, it's not needed anymore.
@gfoidl
Copy link
Member Author

gfoidl commented Oct 24, 2019

Messed something up while copying from one vm (local build) to the repo. Will fix that build failure.

System\Text\Encodings\Web\Ssse3Helper.cs(101,13): error SA1115: (NETCORE_ENGINEERING_TELEMETRY=Build) The parameter should begin on the line after the previous parameter.
@gfoidl
Copy link
Member Author

gfoidl commented Oct 24, 2019

@ahsonkhan please help me figure out what causes the build-failure

System\Text\Encodings\Web\DefaultJavaScriptEncoderBasicLatin.cs(383,42): error CS0246:
(NETCORE_ENGINEERING_TELEMETRY=Build)
The type or namespace name 'Vector128<>' could not be found (are you missing a using directive or an assembly reference?)

It's based on your latest changes in this project (which builds), a local build runs fine (./build.sh). I'm stuck...

@ahsonkhan
Copy link
Member

PR41845
PR_gfoidl

Could you compare against master and this PR instead, specifically for Test_EscapingBenchmark?
Also, are you explicitly setting JavaScriptEncoder to default in those benchmarks?

@gfoidl
Copy link
Member Author

gfoidl commented Oct 25, 2019

Now I got a trace why the numbers for small inputs are so bad, compared to current master.

(Which I couldn't understand so far, as the asm for the sequential path looks quite similar, this PR produces even less instructions therefore, so a difference of 30% felt definitely too much.)

Test_Escaping-benchmark-code is

// setup...not shown

[Benchmark(Baseline = true)]
public unsafe int Master()
{
    fixed (char* ptr = TestStringData.JsonString)
    {
        return DefaultJavaScriptEncoderBasicLatin_master.s_singleton.FindFirstCharacterToEncode(ptr, TestStringData.JsonString.Length);
    }
}

[Benchmark]
public int PR_42073()
{
    fixed (char* ptr = TestStringData.JsonString)
    {
        return DefaultJavaScriptEncoderBasicLatin.s_singleton.FindFirstCharacterToEncode(ptr, TestStringData.JsonString.Length);
    }
}

where DefaultJavaScriptEncoderBasicLatin_master is the copy from master, so it can be compared in a direct way.

FindFirstCharacterToEncode is attributed with AggressiveInlining (I'm pretty sure we don't want to put this attribute on this large method anyway).

When I look at the Jit-dumps for a simple program like

using System;
using System.Text.Encodings.Web;

namespace ConsoleApp1
{
    class Program
    {
        unsafe static int Main()
        {
            string text = "vabvxgyvhoqqfhea¢";

            fixed (char* ptr = text)
            {
                int idx = JavaScriptEncoder.Default.FindFirstCharacterToEncode(ptr, text.Length);
                return idx;
            }
        }
    }
}

I can see that JavaScriptEncoder.Default gets devirtualized:

Querying runtime about current class of field DefaultJavaScriptEncoderBasicLatin.s_singleton (declared as DefaultJavaScriptEncoderBasicLatin)
Field's current class not available

impDevirtualizeCall: Trying to devirtualize virtual call:
    class for 'this' is DefaultJavaScriptEncoderBasicLatin [final] (attrib 21000010)
    base method is TextEncoder::FindFirstCharacterToEncode
    devirt to DefaultJavaScriptEncoderBasicLatin::FindFirstCharacterToEncode -- final class

but not inlined

[FAILED: target not direct] TextEncoder:FindFirstCharacterToEncode(long,int):int:this

(which is ok, as the code to inline is huge).
It's the same for master and this PR.

For the Test_Escaping-benchmark this means (just showing length := 1)

With AggressiveInlining

|   Method | TestStringData |     Mean |     Error |    StdDev | Ratio |
|--------- |--------------- |---------:|----------:|----------:|------:|
|   Master |     (1, -1, r) | 3.928 ns | 0.0074 ns | 0.0058 ns |  1.00 |
| PR_42073 |     (1, -1, r) | 4.849 ns | 0.0117 ns | 0.0104 ns |  1.23 |
|          |                |          |           |           |       |
|   Master |   (1, 0, &lt;) | 3.815 ns | 0.0131 ns | 0.0109 ns |  1.00 |
| PR_42073 |   (1, 0, &lt;) | 4.583 ns | 0.0107 ns | 0.0094 ns |  1.20 |

Without AggressiveInlining

|   Method | TestStringData |     Mean |     Error |    StdDev | Ratio |
|--------- |--------------- |---------:|----------:|----------:|------:|
|   Master |     (1, -1, r) | 5.320 ns | 0.0132 ns | 0.0124 ns |  1.00 |
| PR_42073 |     (1, -1, r) | 5.337 ns | 0.0083 ns | 0.0074 ns |  1.00 |
|          |                |          |           |           |       |
|   Master |   (1, 0, &lt;) | 5.277 ns | 0.0119 ns | 0.0105 ns |  1.00 |
| PR_42073 |   (1, 0, &lt;) | 5.093 ns | 0.0081 ns | 0.0068 ns |  0.97 |

As said before FindFirstCharacterToEncode won't be inlined, so the previous benchmark-results (in the comments above) are not correct, and without AggressiveInlining the numbers look quite satifying.
I'll re-run the benchmarks, to get the correct numbers.


Aside: when FindFirstCharacterToEncode won't be inlined, I don't unterstand why there's such a difference in the benchmarks. BDN calls the benchmark from a delegate in order to prevent inlining-issues (at least is this what I know), so it feels strange here...or the is something really obvious which I don't see right now.

@gfoidl
Copy link
Member Author

gfoidl commented Oct 25, 2019

I merged the branch from #42064 into this PR (#42064 (comment)), so these changes are also shown right now in the diff.

return mask;
}

private static readonly Vector128<sbyte> s_nibbleMaskSByte = Vector128.Create((sbyte)0xF);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name for the mask 0xF?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been using the exact unicode character name, but maybe a context/usage specific name would be better:
https://www.fileformat.info/info/unicode/char/000f/index.htm

@gfoidl
Copy link
Member Author

gfoidl commented Oct 25, 2019

Current benchmark-numbers are updated in the top-post.
Now it goes in the right direction 😄

@gfoidl
Copy link
Member Author

gfoidl commented Oct 26, 2019

I'll also run benchmarks with SSSE3 disabled, to see if the SSE2 code-path didn't regress.

Edit: results

Test_Escaping with SSSE3 disabled
|   Method |        TestStringData |      Mean |     Error |    StdDev | Ratio | RatioSD |
|--------- |---------------------- |----------:|----------:|----------:|------:|--------:|
|   Master |            (1, -1, r) |  5.346 ns | 0.0242 ns | 0.0215 ns |  1.00 |    0.00 |
| PR_42073 |            (1, -1, r) |  5.348 ns | 0.0337 ns | 0.0315 ns |  1.00 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |          (1, 0, &lt;) |  5.595 ns | 0.0137 ns | 0.0122 ns |  1.00 |    0.00 |
| PR_42073 |          (1, 0, &lt;) |  5.061 ns | 0.0391 ns | 0.0366 ns |  0.90 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master | (100,(...)kuwu) [111] | 47.783 ns | 0.1844 ns | 0.1540 ns |  1.00 |    0.00 |
| PR_42073 | (100,(...)kuwu) [111] | 34.408 ns | 0.4041 ns | 0.3780 ns |  0.72 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |  (15, (...)dabu) [25] | 21.280 ns | 0.1031 ns | 0.0965 ns |  1.00 |    0.00 |
| PR_42073 |  (15, (...)dabu) [25] | 19.720 ns | 0.2443 ns | 0.2166 ns |  0.93 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |  (16, (...)sqfa) [26] | 14.448 ns | 0.1326 ns | 0.1240 ns |  1.00 |    0.00 |
| PR_42073 |  (16, (...)sqfa) [26] | 12.157 ns | 0.0338 ns | 0.0300 ns |  0.84 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |  (17, (...)aabr) [27] | 15.509 ns | 0.0712 ns | 0.0595 ns |  1.00 |    0.00 |
| PR_42073 |  (17, (...)aabr) [27] | 13.372 ns | 0.0273 ns | 0.0228 ns |  0.86 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |           (2, -1, dd) |  6.633 ns | 0.0183 ns | 0.0171 ns |  1.00 |    0.00 |
| PR_42073 |           (2, -1, dd) |  6.660 ns | 0.0268 ns | 0.0238 ns |  1.00 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |  (32, (...)dfpn) [42] | 20.854 ns | 0.8212 ns | 0.9457 ns |  1.00 |    0.00 |
| PR_42073 |  (32, (...)dfpn) [42] | 15.177 ns | 0.0671 ns | 0.0595 ns |  0.72 |    0.03 |
|          |                       |           |           |           |       |         |
|   Master |         (4, -1, negs) |  9.697 ns | 0.0529 ns | 0.0469 ns |  1.00 |    0.00 |
| PR_42073 |         (4, -1, negs) |  9.590 ns | 0.0342 ns | 0.0320 ns |  0.99 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |      (7, -1, netggni) | 14.180 ns | 0.1087 ns | 0.1017 ns |  1.00 |    0.00 |
| PR_42073 |      (7, -1, netggni) | 13.194 ns | 0.0765 ns | 0.0678 ns |  0.93 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 0, &lt;etggni) |  5.086 ns | 0.0187 ns | 0.0175 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 0, &lt;etggni) |  5.072 ns | 0.0172 ns | 0.0161 ns |  1.00 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 1, n&lt;tggni) |  6.545 ns | 0.0314 ns | 0.0278 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 1, n&lt;tggni) |  6.225 ns | 0.0476 ns | 0.0422 ns |  0.95 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 2, ne&lt;ggni) |  8.075 ns | 0.0699 ns | 0.0619 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 2, ne&lt;ggni) |  7.630 ns | 0.0257 ns | 0.0241 ns |  0.95 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 3, net&lt;gni) |  9.520 ns | 0.0486 ns | 0.0405 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 3, net&lt;gni) |  8.887 ns | 0.0484 ns | 0.0429 ns |  0.93 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 4, netg&lt;ni) | 11.062 ns | 0.0419 ns | 0.0371 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 4, netg&lt;ni) | 10.023 ns | 0.0461 ns | 0.0431 ns |  0.91 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 5, netgg&lt;i) | 12.103 ns | 0.0438 ns | 0.0410 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 5, netgg&lt;i) | 11.176 ns | 0.0405 ns | 0.0379 ns |  0.92 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |    (7, 6, netggn&lt;) | 13.662 ns | 0.0551 ns | 0.0488 ns |  1.00 |    0.00 |
| PR_42073 |    (7, 6, netggn&lt;) | 14.112 ns | 0.0671 ns | 0.0595 ns |  1.03 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |     (8, -1, jgnavpkd) | 10.759 ns | 0.0354 ns | 0.0332 ns |  1.00 |    0.00 |
| PR_42073 |     (8, -1, jgnavpkd) | 12.192 ns | 0.1083 ns | 0.1013 ns |  1.13 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 0, &lt;gnavpkd) | 10.771 ns | 0.0463 ns | 0.0387 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 0, &lt;gnavpkd) | 11.500 ns | 0.0812 ns | 0.0719 ns |  1.07 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 1, j&lt;navpkd) | 11.599 ns | 1.0037 ns | 1.0307 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 1, j&lt;navpkd) | 11.356 ns | 0.0300 ns | 0.0281 ns |  0.98 |    0.08 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 2, jg&lt;avpkd) | 10.636 ns | 0.0440 ns | 0.0390 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 2, jg&lt;avpkd) | 13.819 ns | 0.0388 ns | 0.0344 ns |  1.30 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 3, jgn&lt;vpkd) | 10.970 ns | 0.0311 ns | 0.0291 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 3, jgn&lt;vpkd) | 11.487 ns | 0.0464 ns | 0.0434 ns |  1.05 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 4, jgna&lt;pkd) | 11.017 ns | 0.1310 ns | 0.1023 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 4, jgna&lt;pkd) | 11.514 ns | 0.0632 ns | 0.0591 ns |  1.05 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 5, jgnav&lt;kd) | 11.011 ns | 0.0353 ns | 0.0313 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 5, jgnav&lt;kd) | 11.497 ns | 0.0333 ns | 0.0295 ns |  1.04 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 6, jgnavp&lt;d) | 10.996 ns | 0.0220 ns | 0.0172 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 6, jgnavp&lt;d) | 11.485 ns | 0.0477 ns | 0.0446 ns |  1.04 |    0.00 |
|          |                       |           |           |           |       |         |
|   Master |   (8, 7, jgnavpk&lt;) | 10.981 ns | 0.0584 ns | 0.0547 ns |  1.00 |    0.00 |
| PR_42073 |   (8, 7, jgnavpk&lt;) | 11.461 ns | 0.0264 ns | 0.0247 ns |  1.04 |    0.01 |
|          |                       |           |           |           |       |         |
|   Master |    (9, -1, csvobsdxs) | 11.978 ns | 0.0310 ns | 0.0275 ns |  1.00 |    0.00 |
| PR_42073 |    (9, -1, csvobsdxs) | 13.429 ns | 0.0560 ns | 0.0467 ns |  1.12 |    0.01 |

Discussion

  • where perf improves it is from "combining" two char-vectors to one byte-vetor, but the bitmask-approach (SSSE3) is still faster
  • where perf decreaes it is due higher overhead -- only small inputs are affected
  • I assume it is quite uncommon that SSSE3 isn't available (when SSE2 is available, so I would not like to special case the code further (e.g. don't process the remainder of the vectorized-codepath again vectorized only in SSE2)

Thoughts (especially about the last point)?


Vector128<sbyte> mask = Sse2.And(bitPositions, bitMask);

Vector128<sbyte> ne0Gt = Sse2.CompareGreaterThan(mask, s_nullMaskSByte);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're looking for the equivalent of "not equal to zero", I believe the following will produce more efficient codegen:

return Sse2.CompareEqual(Sse2.CompareEqual(mask, default), default);

(And it'd allow getting rid of the static field.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍
It safes the oring of the tmp-comparisons.

So far I didn't know that the JIT treats default here the same way as Vector128<sbyte>.Zero.

(And it'd allow getting rid of the static field.)

The static field is intentional, as the JIT hoists it outside the loop.

w/o static ro field
; ...
       C4E25900DB           vpshufb  xmm3, xmm4, xmm3
       C4E25100E7           vpshufb  xmm4, xmm5, xmm7
       C5D9DBDB             vpand    xmm3, xmm4, xmm3
       C5D857E4             vxorps   xmm4, xmm4
       C5D057ED             vxorps   xmm5, xmm5
       C5D174DB             vpcmpeqb xmm3, xmm5, xmm3
       C5D974DB             vpcmpeqb xmm3, xmm4, xmm3
       C5F9D7C3             vpmovmskb eax, xmm3
; ...
with static ro field
; ...
       C5F9105808           vmovupd  xmm3, xmmword ptr [rax+8]

G_M60692_IG11:      ;; bbWeight=4   ; loop start
; ...
       C4E25100E4           vpshufb  xmm4, xmm5, xmm4
       C4C24900E9           vpshufb  xmm5, xmm6, xmm9
       C5D1DBE4             vpand    xmm4, xmm5, xmm4
       C5B974E4             vpcmpeqb xmm4, xmm8, xmm4
       C5B974E4             vpcmpeqb xmm4, xmm8, xmm4
       C5F9D7C4             vpmovmskb eax, xmm4
; ...

@gfoidl
Copy link
Member Author

gfoidl commented Nov 2, 2019

Resolved the merge-conflict with master (src/System.Text.Encodings.Web/src/System/Text/Encodings/Web/Sse2Helper.cs).

@GrabYourPitchforks
Copy link
Member

Test failure is unrelated. I can merge if you have no further changes you wish to make.

@gfoidl
Copy link
Member Author

gfoidl commented Nov 2, 2019

I have no further changes to this PR.

@GrabYourPitchforks GrabYourPitchforks merged commit ba320d4 into dotnet:master Nov 3, 2019
@GrabYourPitchforks
Copy link
Member

It's in! Thanks so much for the contribution. :D

@gfoidl gfoidl deleted the ImproveEscapingCheck branch November 3, 2019 00:32
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants