Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from simd to packed_simd #23

Closed
jrmuizel opened this issue Oct 27, 2017 · 62 comments
Closed

Switch from simd to packed_simd #23

jrmuizel opened this issue Oct 27, 2017 · 62 comments

Comments

@jrmuizel
Copy link

stdsimd is the replacement for simd

@hsivonen
Copy link
Owner

hsivonen commented Nov 1, 2017

Thanks. I hadn't noticed that stdsimd had gained NEON support.

To my surprise, there are no longer distinct boolean vector types in stdsimd. They seem to be gone from WebAssembly SIMD, too!

Yet, I don't see analogs for the all() and any() methods that the boolean vectors had in the simd crate.

I guess I should find out the rationale for the change and what the design intent is for replacements for all() and any().

@BurntSushi
Copy link
Contributor

@hsivonen The current plan is to be as conservative as possible. I don't think the OP, "stdsimd is the replacement for simd," is accurate as of present. My suspicion is that the next phase is probably going to be something like, "stable, but inconvenient portability story."

@hsivonen
Copy link
Owner

hsivonen commented Mar 8, 2018

Portable SIMD got merged in to stdsimd, so I'm going to see if encoding_rs can now be ported over.

@hsivonen
Copy link
Owner

hsivonen commented Mar 8, 2018

This is happening on a branch for now.

Swapping out the dependency was pretty easy for the portable part. (I didn't switch the intrinsics over yet.) I added a Simd trait with the associated lane type on the encoding_rs side in order to declare the signature for shuffles.

On x86_64, performance regressed for everything the depends heavily on any() and all(). From prior stdsimd source inspection, this was to be expected, since stdsimd computes unnecessary bits if the LLVM optimizer isn't smart enough not to compute them. I haven't looked at the disassembly yet, but it seems to me that either stdsimd needs to allow for architecture-specific manually-written any()/all()/none() or LLVM needs to get more complex pattern matching in the optimizer.

@hsivonen
Copy link
Owner

hsivonen commented Mar 9, 2018

Yesterday, I failed to notice that in stdsimd comparing two u16x8s yields b8x8 and there is no b16x8. Commented on the stdsimd issue.

@hsivonen
Copy link
Owner

hsivonen commented Apr 7, 2018

stdsimd now has mask vectors which are like simd crate's boolean vectors. The stdsimd branch of encoding_rs compiles again and the x86_64 asm for any() and all() is now a bit better than on March 8, but it's still worse than what the simd crate produces.

stdsimd issue. LLVM bug.

@ignatenkobrain
Copy link

any news here?

@hsivonen
Copy link
Owner

hsivonen commented Aug 7, 2018

No news yet. I was away for a bit and will need to catch up on the situation regarding any()/all() inlining with stdsimd, especially in the ARMv7+NEON case.

@gnzlbg
Copy link

gnzlbg commented Aug 14, 2018

@hsivonen packed_simd fixed all of those issues a while ago, and provides many many new features (shuffles with run-time indices, portable vector gather/scatters, ...).

Let me know if that isn't the case, I think the simd crate can pretty much be removed. All its features are now available in packed_simd, all examples have been ported, etc.

@hsivonen
Copy link
Owner

@gnzlbg, can I now get fast boolean reductions on ARMv7+NEON without having the standard library compiled with NEON enabled? I.e. do I need to revive that PR and debug the Android Docker issue?

@gnzlbg
Copy link

gnzlbg commented Aug 28, 2018

You can if you enable the coresimd feature of the packed_simd crate (it will use coresimd instead of core::arch for the intrinsics, and coresimd can be recompiled by user code without issues, workaround having to recompile the std library).

Note, however, that this uses a particular released version of coresimd in crates.io, which might break at some point due to changes on nightly. If that ever happens, you can just report it in the stdsimd repo so that we do a new release, and in the mean time, patch it to use the stdsimd git repo directly.

@gnzlbg
Copy link

gnzlbg commented Aug 28, 2018

I just released version 0.2 of packed_simd, you should prefer that over the 0.1 version, which did not allow doing that (because it relied on coresimd as a git dependency, which had to be disabled for crates.io).

I.e. do I need to revive that PR and debug the Android Docker issue?

Yeah, you should (its almost there, and if you have access to a linux workstation debugging locally shouldn't be too hard). This is a short term workaround that will never work on stable.

@hsivonen
Copy link
Owner

I've ported encoding_rs to packed_simd on a branch.

At least on x86_64 (Haswell desktop i7), there are repeatable performance differences. Here's the output of per-benchmark best results from 4 simd runs vs. per-benchmark best results from 4 packed_simd runs with threshold of inclusion at 2 percentage points or larger difference:

 name                                          simd.txt ns/iter     packed.txt ns/iter   diff ns/iter   diff %  speedup 
 bench_decode_to_string_tr                     79,661 (3677 MB/s)   77,881 (3761 MB/s)         -1,780   -2.23%   x 1.02 
 bench_decode_to_utf16_en_windows_1252         68,420 (10391 MB/s)  66,839 (10637 MB/s)        -1,581   -2.31%   x 1.02 
 bench_decode_to_utf16_zh_cn_utf_16be          36,308 (15063 MB/s)  37,074 (14751 MB/s)           766    2.11%   x 0.98 
 bench_mem_convert_latin1_to_str_1             3,642 (137 MB/s)     3,510 (142 MB/s)             -132   -3.62%   x 1.04 
 bench_mem_convert_latin1_to_utf8_1            1,823 (274 MB/s)     1,697 (294 MB/s)             -126   -6.91%   x 1.07 
 bench_mem_convert_latin1_to_utf8_3            2,705 (554 MB/s)     2,538 (591 MB/s)             -167   -6.17%   x 1.07 
 bench_mem_convert_str_to_utf16_ascii_1        1,730 (289 MB/s)     1,821 (274 MB/s)               91    5.26%   x 0.95 
 bench_mem_convert_str_to_utf16_ascii_16       2,067 (3870 MB/s)    2,130 (3755 MB/s)              63    3.05%   x 0.97 
 bench_mem_convert_utf16_to_latin1_lossy_1     846 (1182 MB/s)      989 (1011 MB/s)               143   16.90%   x 0.86 
 bench_mem_convert_utf16_to_latin1_lossy_1000  15,605 (64082 MB/s)  16,904 (59157 MB/s)         1,299    8.32%   x 0.92 
 bench_mem_convert_utf16_to_latin1_lossy_15    3,910 (3836 MB/s)    4,001 (3749 MB/s)              91    2.33%   x 0.98 
 bench_mem_convert_utf16_to_latin1_lossy_16    1,823 (8776 MB/s)    1,949 (8209 MB/s)             126    6.91%   x 0.94 
 bench_mem_convert_utf16_to_latin1_lossy_3     1,330 (2255 MB/s)    1,371 (2188 MB/s)              41    3.08%   x 0.97 
 bench_mem_convert_utf16_to_str_ascii_16       4,858 (3293 MB/s)    4,671 (3425 MB/s)            -187   -3.85%   x 1.04 
 bench_mem_convert_utf16_to_utf8_ascii_1       2,928 (341 MB/s)     2,987 (334 MB/s)               59    2.02%   x 0.98 
 bench_mem_convert_utf16_to_utf8_ascii_16      3,353 (4771 MB/s)    3,443 (4647 MB/s)              90    2.68%   x 0.97 
 bench_mem_convert_utf8_to_latin1_lossy_30     6,051 (2478 MB/s)    6,189 (2423 MB/s)             138    2.28%   x 0.98 
 bench_mem_convert_utf8_to_utf16_ascii_1       5,670 (88 MB/s)      5,791 (86 MB/s)               121    2.13%   x 0.98 
 bench_mem_copy_ascii_to_ascii_1               923 (541 MB/s)       1,049 (476 MB/s)              126   13.65%   x 0.88 
 bench_mem_copy_ascii_to_ascii_1000            16,087 (31080 MB/s)  20,463 (24434 MB/s)         4,376   27.20%   x 0.79 
 bench_mem_copy_ascii_to_ascii_15              4,893 (1532 MB/s)    4,544 (1650 MB/s)            -349   -7.13%   x 1.08 
 bench_mem_copy_ascii_to_ascii_16              945 (8465 MB/s)      973 (8221 MB/s)                28    2.96%   x 0.97 
 bench_mem_copy_ascii_to_ascii_3               1,254 (1196 MB/s)    1,286 (1166 MB/s)              32    2.55%   x 0.98 
 bench_mem_copy_ascii_to_ascii_30              5,151 (2912 MB/s)    4,536 (3306 MB/s)            -615  -11.94%   x 1.14 
 bench_mem_copy_basic_latin_to_ascii_1         940 (1063 MB/s)      988 (1012 MB/s)                48    5.11%   x 0.95 
 bench_mem_copy_basic_latin_to_ascii_16        1,340 (11940 MB/s)   1,407 (11371 MB/s)             67    5.00%   x 0.95 
 bench_mem_copy_basic_latin_to_ascii_3         1,422 (2109 MB/s)    1,458 (2057 MB/s)              36    2.53%   x 0.98 
 bench_mem_ensure_utf16_validity_ascii_16      2,962 (5401 MB/s)    3,090 (5177 MB/s)             128    4.32%   x 0.96 
 bench_mem_ensure_utf16_validity_ascii_30      5,429 (5525 MB/s)    5,277 (5685 MB/s)            -152   -2.80%   x 1.03 
 bench_mem_ensure_utf16_validity_bmp_16        2,962 (5401 MB/s)    3,090 (5177 MB/s)             128    4.32%   x 0.96 
 bench_mem_ensure_utf16_validity_bmp_30        5,426 (5528 MB/s)    5,276 (5686 MB/s)            -150   -2.76%   x 1.03 
 bench_mem_ensure_utf16_validity_latin1_16     2,962 (5401 MB/s)    2,838 (5637 MB/s)            -124   -4.19%   x 1.04 
 bench_mem_ensure_utf16_validity_latin1_30     5,406 (5549 MB/s)    5,277 (5685 MB/s)            -129   -2.39%   x 1.02 
 bench_mem_is_basic_latin_false_16             1,446 (11065 MB/s)   1,323 (12093 MB/s)           -123   -8.51%   x 1.09 
 bench_mem_is_basic_latin_false_30             1,810 (16574 MB/s)   1,934 (15511 MB/s)            124    6.85%   x 0.94 
 bench_mem_is_basic_latin_true_1000            14,091 (70967 MB/s)  16,297 (61360 MB/s)         2,206   15.66%   x 0.86 
 bench_mem_is_str_latin1_true_1                643 (777 MB/s)       709 (705 MB/s)                 66   10.26%   x 0.91 
 bench_mem_is_str_latin1_true_1000             28,798 (17362 MB/s)  21,069 (23731 MB/s)        -7,729  -26.84%   x 1.37 
 bench_mem_is_str_latin1_true_15               3,124 (2400 MB/s)    3,772 (1988 MB/s)             648   20.74%   x 0.83 
 bench_mem_is_str_latin1_true_3                1,422 (1054 MB/s)    917 (1635 MB/s)              -505  -35.51%   x 1.55 
 bench_mem_is_str_latin1_true_30               3,623 (4140 MB/s)    3,550 (4225 MB/s)             -73   -2.01%   x 1.02 
 bench_mem_is_utf16_latin1_true_1000           13,932 (71777 MB/s)  14,512 (68908 MB/s)           580    4.16%   x 0.96 
 bench_mem_is_utf8_latin1_false_1000           1,041 (480307 MB/s)  975 (512820 MB/s)             -66   -6.34%   x 1.07 
 bench_mem_is_utf8_latin1_false_16             1,041 (7684 MB/s)    974 (8213 MB/s)               -67   -6.44%   x 1.07 
 bench_mem_is_utf8_latin1_false_30             1,041 (14409 MB/s)   974 (15400 MB/s)              -67   -6.44%   x 1.07 
 bench_mem_is_utf8_latin1_true_15              4,761 (1575 MB/s)    5,403 (1388 MB/s)             642   13.48%   x 0.88 
 bench_mem_is_utf8_latin1_true_16              4,488 (1782 MB/s)    4,890 (1635 MB/s)             402    8.96%   x 0.92 
 bench_mem_is_utf8_latin1_true_3               915 (1639 MB/s)      1,806 (830 MB/s)              891   97.38%   x 0.51 
 bench_mem_utf16_valid_up_to_ascii_3           2,644 (1134 MB/s)    2,579 (1163 MB/s)             -65   -2.46%   x 1.03 

Many of the changes are with buffer sizes that are so short that SIMD code isn't used, which suggests flips in branch prediction heuristics or something like that.

The actually worrying cases are these:

 bench_mem_copy_ascii_to_ascii_1000            16,087 (31080 MB/s)  20,463 (24434 MB/s)         4,376   27.20%   x 0.79 
 bench_mem_is_basic_latin_true_1000            14,091 (70967 MB/s)  16,297 (61360 MB/s)         2,206   15.66%   x 0.86 
 bench_mem_is_utf16_latin1_true_1000           13,932 (71777 MB/s)  14,512 (68908 MB/s)           580    4.16%   x 0.96 

@hsivonen
Copy link
Owner

Oh, and those at at opt_level=2, because that's what Firefox builds with.

@gnzlbg
Copy link

gnzlbg commented Sep 12, 2018

Which packed_simd features are you enabling ? Which target? Which RUSTFLAGS ?

Could you extract the code of those three benchmarks into its own crate ?

@gnzlbg
Copy link

gnzlbg commented Sep 12, 2018

Also the worst benchmark appears to be bench_mem_is_utf8_latin1_true_3

@gnzlbg
Copy link

gnzlbg commented Sep 12, 2018

I've left a couple of comments in your branch, there is one suspicious function there that might be causing the regression, we should take a closer look at that comparing the output of the simd and packed_simd crates for the version with and without select (without select, both crates should generate exactly the same code).

@hsivonen
Copy link
Owner

Which packed_simd features are you enabling ? Which target? Which RUSTFLAGS ?

The above numbers were with the into_bits feature enabled, x86_64-unknown-linux-gnu target, -C opt_level=2 in RUSTFLAGS on Ubuntu 18.04 on Intel Core i7-4770 @ 3.40 GHz.

Could you extract the code of those three benchmarks into its own crate ?

I'll try to minimize the test case.

Also the worst benchmark appears to be bench_mem_is_utf8_latin1_true_3

Indeed. That benchmark shouldn't even run SIMD code, because 3 < 16, which is why I speculated about prediction heuristics flipping or something.

I've left a couple of comments in your branch

Thanks!

there is one suspicious function there that might be causing the regression

The suspicious function affects only x-user-defined to UTF-16 decoding and is performance-neutral. (The change is based on previous advice from you. :-)

@gnzlbg
Copy link

gnzlbg commented Sep 13, 2018

I'll try to minimize the test case.

Thank you! Feel free to open an issue in packed_simd and we'll look at it.

@hsivonen
Copy link
Owner

On aarch64 (ThunderX), the fluctuation of the numbers is a bit larger, so there's are more changes in both directions with the 2-percentage-point threshold. I'm not too worried about that.

Instead of bench_mem_is_utf8_latin1_N with small N regressing, the closely related bench_mem_is_str_latin1_true_N with small N regresses.

(These are with the horizontal max intrinsics instead of the portable horizontal max operations.)

@gnzlbg
Copy link

gnzlbg commented Sep 13, 2018

@hsivonen could you try again with packed_simd master branch for both x86 and aarch64?

@gnzlbg
Copy link

gnzlbg commented Sep 13, 2018

Maybe there is some inlining issue / target-feature issue popping up, since the simd crate uses platform-intrinsics and the packed_simd crate uses core::arch, so maybe RUSTFLAGS have something to do with this. As in, maybe the simd crate is using a target feature that it shouldn't be allow to use for a particular CPU and that just happens to work because the CPUs that you are trying do support the feature? You could also try with RUSTFLAGS="-C target-cpu=native" and see how they compare, and maybe then try nailing things down with RUSTFLAGS="-C target-feature=+sse4.2" and see what happens.

@hsivonen
Copy link
Owner

On x86_64 the regressions go away when building with the coresimd feature and without RUSTFLAGS. Up next: Figuring out which one of those makes the difference.

@hsivonen
Copy link
Owner

It seems that opt_level=2 is the problem. With or without coresimd feature flag affects which tests regress, though.

@hsivonen
Copy link
Owner

hsivonen commented Oct 15, 2018

Sorry about the delay. After updating packed_simd (but staying on rustc from the turn of the month, because packaging for clippy blocks updates), I still see perf regressions at opt_level=2 with encoding_bench (and now also a regression of bench_mem_utf16_valid_up_to_ascii_1000 at opt_level=3). However, when trying to create a smaller test case for the opt_level=2 regressions by extracting the code to a single-file crate, the regressions go away.

Still more to investigate.

Meanwhile: It appears that cargo is getting the capability to set per-crate opt_level, so maybe this is going to be solved by building encoding_rs at opt_level=3 in Firefox. OTOH, a build system peer considered packed_simd's compatibilily policy with only the latest nightly as a blocker, so it seems that migration from simd to packed_simd has to wait until packed_simd makes it into the standard library.

@BurntSushi
Copy link
Contributor

OTOH, a build system peer considered packed_simd's compatibilily policy with only the latest nightly as a blocker, so it seems that migration from simd to packed_simd has to wait until packed_simd makes it into the standard library.

I thought that was simd's policy also? At least, in the past, IIRC, if a nightly came out that broke simd, we would fix it without necessarily keeping compatibility with older nightlies.

It may be that packed_simd makes this more of a problem in practice since it is being actively developed.

@gnzlbg
Copy link

gnzlbg commented Oct 15, 2018

I thought that was simd's policy also?

It is simd's crate policy and core::arch::{arm, aarch64} policy as well.

@hsivonen
Copy link
Owner

However, when trying to create a smaller test case for the opt_level=2 regressions by extracting the code to a single-file crate, the regressions go away.

Specifically, the code using simd and the code using packed_simd compile to identical assembly except for label naming.

I thought that was simd's policy also?

In practice, simd isn't being developed further, so it doesn't adopt new Rust features that would require a new compiler version and, OTOH, rustc hasn't broken the relevant parts of simd in a long while.

What's the practical breakage situation with packed_simd?

@gnzlbg
Copy link

gnzlbg commented Oct 15, 2018

What's the practical breakage situation with packed_simd?

The library cannot build on stable Rust, only on nightly. It is still being developed, with new features, bug fixes, and performance fixes being added every now and then. Sometimes these require patching rustc, and we tend to then start using the features as soon as the next nightly is released, and until now the only stakeholders that complain are the rand developers, but we tend to let them know when breaking changes happen and we help with the PR required to upgrade. This isn't too hard because rand unstable follows the latest nightly anyways.

If a bigger stake holder (e.g. servo / firefox) with stricter "unstable stability requirements" starts depending on packed_simd, then we'll have to adapt our workflow around those requirements.

For example, we could add a CI build bot to test the nightly version used by stakeholders, and add any new features that would break that behind a feature gate. Once all stakeholders upgrade, then we could remove those feature gates.

@gnzlbg
Copy link

gnzlbg commented Jan 21, 2019

FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it.

@hsivonen
Copy link
Owner

OK. With this problem fixed, it compiles for thumbv7neon but compiling for aarch64 fails, because the compiler tries to make a 1 GB memory allocation when compiling packed_simd 0.3.1.

Is this a previously known problem?

memory allocation of 1073741824 bytes failed
error: Could not compile packed_simd.

Caused by:
process didn't exit successfully: rustc --edition=2018 --crate-name packed_simd /home/hsivonen/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.1/src/lib.rs --color always --crate-type lib --emit=dep-info,link -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="into_bits"' -C metadata=2b4eed31d1b1027b -C extra-filename=-2b4eed31d1b1027b --out-dir /home/hsivonen/Projects/encoding_rs/target/debug/deps -L dependency=/home/hsivonen/Projects/encoding_rs/target/debug/deps --extern cfg_if=/home/hsivonen/Projects/encoding_rs/target/debug/deps/libcfg_if-e6de5e9a7e5221ad.rlib --cap-lints allow (signal: 6, SIGABRT: process abort signal)

@gnzlbg
Copy link

gnzlbg commented Jan 21, 2019

Is this a previously known problem?

I haven't seen that one before, which target are you using? aarch64-unknown-linux-android ? It does not surprise me much, rustc consumes a lot of memory when compiling packed_simd, but compilation passes on targets without memory overcommit like windows on constrained platforms like appveyor so, 1Gb does look like a lot here.

@hsivonen
Copy link
Owner

I haven't seen that one before, which target are you using?

Self-hosted aarch64-unknown-linux-gnu.

@hsivonen
Copy link
Owner

Hmm. Trying to compile on an x86_64 host both for aarch64 and x86_64 shows that rustc's memory usage climbs to a bit over 2 GB when compiling packed_simd 0.3.1.

I'm pretty sure that 0.3.0 with earlier rustc didn't have this problem, but given the earlier comments today, I no longer trust myself having actually built this code on aarch64 previously.

I'll see if I can buy more RAM for my aarch64 VM, since it's annoying to run cargo bench with cross-compile.

@gnzlbg
Copy link

gnzlbg commented Jan 21, 2019

Sadly it is not easy to only build a "subset" of packed_simd (say just 128-bit wide vector support).

I've been thinking about making packed_simd "lazy", such that code is monomorphized when a user uses a type instead of always at the build time of packed_simd, but that's a trade-off.

@hsivonen
Copy link
Owner

Sadly it is not easy to only build a "subset" of packed_simd (say just 128-bit wide vector support).

Maybe I should try commenting out code to get some results.

I'll see if I can buy more RAM for my aarch64 VM

The provider represents their aarch64 capacity as "out of stock", so can't get more RAM today.

@gnzlbg
Copy link

gnzlbg commented Jan 21, 2019

Maybe I should try commenting out code to get some results.

You can try commenting out certain APIs, e.g., in src/api.rs but commenting out vector widths won't easily work (some features like vectors of pointers couple vectors of different lengths).

@hsivonen
Copy link
Owner

The build-time RAM requirement can be addressed with swap space plus waiting for the resulting IO, so it seems that building packed_simd at present doesn't impose a hard requirement on physical RAM.

@hsivonen
Copy link
Owner

FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it.

Filed

@hsivonen
Copy link
Owner

Filed an issue about rustc memory usage when compiling packed_simd.

@hsivonen
Copy link
Owner

In case ripgrep users find their way here:

This Firefox bug has "Depends on" and "See also" fields that point to relevant bugs/issues that are blocking migration at present.

@hsivonen hsivonen changed the title Switch from simd to stdsimd Switch from simd to packed_simd Jan 28, 2019
@hsivonen
Copy link
Owner

thumbv7neon-unknown-linux-gnueabihf is now available but I'm still seeing perf regressions on ARMv7 compared to the simd crate. I'll try to run the benchmarks again to be sure. This was with the crates.io release 0.3.1. I'll try with a git clone.

So far, I don't see a simple characterization of the regression other than it seems to generally affect things that I expect to correlate with horizontal boolean reduction and does not seem to really affect workloads that are mostly about shuffles.

@gnzlbg
Copy link

gnzlbg commented Jan 30, 2019

correlate with horizontal boolean reduction

From which types (integers, floats, etc.) are the masks created ?


Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon.

@hsivonen
Copy link
Owner

Discovered so far: The presence of thumb trampolines is the very first thing that stands out in the assembly. I need to go back and run the simd crate baseline in thumb mode.

From which types (integers, floats, etc.) are the masks created ?

From u8x16 and u16x8.

Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon.

It seems to be in the present nightly already. Thanks!

@hsivonen
Copy link
Owner

The presence of thumb trampolines is the very first thing that stands out in the assembly.

Once we get past the trampolines on the crate boundary, inlining from core::arch and packed_simd appears to have worked.

@hsivonen
Copy link
Owner

Thumb-to-Thumb comparison still shows a regression.

@hsivonen
Copy link
Owner

With the simd crate, building encoding_rs with --release and --emit asm emits one .s file. With packed_simd 31 .rcgu.s files are emitted. https://doc.rust-lang.org/rustc/codegen-options/index.html suggests that multiple codegen-units can lead to slower code. RUSTFLAGS='-C codegen-units=1' does not appear to change things.

@gnzlbg
Copy link

gnzlbg commented Jan 30, 2019

@hsivonen can you fill a rust-lang/rust about the multiple codegen-units issue? cc @mw

@hsivonen
Copy link
Owner

can you fill a rust-lang/rust about the multiple codegen-units issue?

Filed

@hsivonen
Copy link
Owner

hsivonen commented Feb 5, 2019

encoding_rs::mem::copy_ascii_to_ascii regresses significantly. To start with, the inlining situation differs. With manual always/never choices, the results are counter-intuitive (never faster than always with simd), but simd is still faster:

simd, inline(never)
test bench_mem_copy_ascii_to_ascii_1000 ... bench: 120,045 ns/iter (+/- 686) = 4165 MB/s

simd, inline(always)
test bench_mem_copy_ascii_to_ascii_1000 ... bench: 129,785 ns/iter (+/- 5,024) = 3852 MB/s

packed_simd, inline(never)
test bench_mem_copy_ascii_to_ascii_1000 ... bench: 164,637 ns/iter (+/- 3,623) = 3036 MB/s

packed_simd, inline(always)
test bench_mem_copy_ascii_to_ascii_1000 ... bench: 160,739 ns/iter (+/- 9,820) = 3110 MB/s

For the never cases, here's the assembly from objdump of the benching binary.

simd:

0006c160 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE>:
   6c160:	b580      	push	{r7, lr}
   6c162:	428b      	cmp	r3, r1
   6c164:	d36a      	bcc.n	6c23c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xdc>
   6c166:	468e      	mov	lr, r1
   6c168:	2910      	cmp	r1, #16
   6c16a:	d31b      	bcc.n	6c1a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x44>
   6c16c:	f002 030f 	and.w	r3, r2, #15
   6c170:	f1ae 0c10 	sub.w	ip, lr, #16
   6c174:	0701      	lsls	r1, r0, #28
   6c176:	d01a      	beq.n	6c1ae <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4e>
   6c178:	b373      	cbz	r3, 6c1d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x78>
   6c17a:	2300      	movs	r3, #0
   6c17c:	18c1      	adds	r1, r0, r3
   6c17e:	f921 0a0f 	vld1.8	{d0-d1}, [r1]
   6c182:	ef89 2050 	vshr.s8	q1, q0, #7
   6c186:	ff02 2a03 	vpmax.u8	d2, d2, d3
   6c18a:	ff02 2a00 	vpmax.u8	d2, d2, d0
   6c18e:	ee12 1b10 	vmov.32	r1, d2[0]
   6c192:	2900      	cmp	r1, #0
   6c194:	d14a      	bne.n	6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c196:	18d1      	adds	r1, r2, r3
   6c198:	3310      	adds	r3, #16
   6c19a:	4563      	cmp	r3, ip
   6c19c:	f901 0a0f 	vst1.8	{d0-d1}, [r1]
   6c1a0:	d9ec      	bls.n	6c17c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x1c>
   6c1a2:	e043      	b.n	6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1a4:	2300      	movs	r3, #0
   6c1a6:	4573      	cmp	r3, lr
   6c1a8:	d342      	bcc.n	6c230 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xd0>
   6c1aa:	4670      	mov	r0, lr
   6c1ac:	bd80      	pop	{r7, pc}
   6c1ae:	b33b      	cbz	r3, 6c200 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa0>
   6c1b0:	2300      	movs	r3, #0
   6c1b2:	18c1      	adds	r1, r0, r3
   6c1b4:	f921 0acf 	vld1.64	{d0-d1}, [r1]
   6c1b8:	ef89 2050 	vshr.s8	q1, q0, #7
   6c1bc:	ff02 2a03 	vpmax.u8	d2, d2, d3
   6c1c0:	ff02 2a00 	vpmax.u8	d2, d2, d0
   6c1c4:	ee12 1b10 	vmov.32	r1, d2[0]
   6c1c8:	bb81      	cbnz	r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1ca:	18d1      	adds	r1, r2, r3
   6c1cc:	3310      	adds	r3, #16
   6c1ce:	4563      	cmp	r3, ip
   6c1d0:	f901 0a0f 	vst1.8	{d0-d1}, [r1]
   6c1d4:	d9ed      	bls.n	6c1b2 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x52>
   6c1d6:	e029      	b.n	6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1d8:	2300      	movs	r3, #0
   6c1da:	18c1      	adds	r1, r0, r3
   6c1dc:	f921 0a0f 	vld1.8	{d0-d1}, [r1]
   6c1e0:	ef89 2050 	vshr.s8	q1, q0, #7
   6c1e4:	ff02 2a03 	vpmax.u8	d2, d2, d3
   6c1e8:	ff02 2a00 	vpmax.u8	d2, d2, d0
   6c1ec:	ee12 1b10 	vmov.32	r1, d2[0]
   6c1f0:	b9e1      	cbnz	r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c1f2:	18d1      	adds	r1, r2, r3
   6c1f4:	3310      	adds	r3, #16
   6c1f6:	4563      	cmp	r3, ip
   6c1f8:	f901 0acf 	vst1.64	{d0-d1}, [r1]
   6c1fc:	d9ed      	bls.n	6c1da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x7a>
   6c1fe:	e015      	b.n	6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c200:	2300      	movs	r3, #0
   6c202:	18c1      	adds	r1, r0, r3
   6c204:	f921 0acf 	vld1.64	{d0-d1}, [r1]
   6c208:	ef89 2050 	vshr.s8	q1, q0, #7
   6c20c:	ff02 2a03 	vpmax.u8	d2, d2, d3
   6c210:	ff02 2a00 	vpmax.u8	d2, d2, d0
   6c214:	ee12 1b10 	vmov.32	r1, d2[0]
   6c218:	b941      	cbnz	r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c21a:	18d1      	adds	r1, r2, r3
   6c21c:	3310      	adds	r3, #16
   6c21e:	4563      	cmp	r3, ip
   6c220:	f901 0acf 	vst1.64	{d0-d1}, [r1]
   6c224:	d9ed      	bls.n	6c202 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa2>
   6c226:	e001      	b.n	6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
   6c228:	54d1      	strb	r1, [r2, r3]
   6c22a:	3301      	adds	r3, #1
   6c22c:	4573      	cmp	r3, lr
   6c22e:	d2bc      	bcs.n	6c1aa <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4a>
   6c230:	56c1      	ldrsb	r1, [r0, r3]
   6c232:	2900      	cmp	r1, #0
   6c234:	daf8      	bge.n	6c228 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xc8>
   6c236:	469e      	mov	lr, r3
   6c238:	4670      	mov	r0, lr
   6c23a:	bd80      	pop	{r7, pc}
   6c23c:	4803      	ldr	r0, [pc, #12]	; (6c24c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xec>)
   6c23e:	2130      	movs	r1, #48	; 0x30
   6c240:	4a03      	ldr	r2, [pc, #12]	; (6c250 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xf0>)
   6c242:	4478      	add	r0, pc
   6c244:	447a      	add	r2, pc
   6c246:	f7ff fefb 	bl	6c040 <_ZN3std9panicking11begin_panic17hb6db914fa10d35c1E>
   6c24a:	defe      	udf	#254	; 0xfe
   6c24c:	009c918c 	.word	0x009c918c
   6c250:	009f3988 	.word	0x009f3988

packed_simd:

00056314 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E>:
   56314:	b570      	push	{r4, r5, r6, lr}
   56316:	428b      	cmp	r3, r1
   56318:	f0c0 8082 	bcc.w	56420 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x10c>
   5631c:	468e      	mov	lr, r1
   5631e:	2910      	cmp	r1, #16
   56320:	d320      	bcc.n	56364 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x50>
   56322:	f002 030f 	and.w	r3, r2, #15
   56326:	f1ae 0c10 	sub.w	ip, lr, #16
   5632a:	0701      	lsls	r1, r0, #28
   5632c:	d01f      	beq.n	5636e <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5a>
   5632e:	b3cb      	cbz	r3, 563a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x90>
   56330:	2300      	movs	r3, #0
   56332:	18c1      	adds	r1, r0, r3
   56334:	f961 0a0f 	vld1.8	{d16-d17}, [r1]
   56338:	efc9 2070 	vshr.s8	q9, q8, #7
   5633c:	ee33 1b90 	vmov.32	r1, d19[1]
   56340:	ee32 4b90 	vmov.32	r4, d18[1]
   56344:	ee13 5b90 	vmov.32	r5, d19[0]
   56348:	ee12 6b90 	vmov.32	r6, d18[0]
   5634c:	4321      	orrs	r1, r4
   5634e:	ea46 0405 	orr.w	r4, r6, r5
   56352:	4321      	orrs	r1, r4
   56354:	d15c      	bne.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   56356:	18d1      	adds	r1, r2, r3
   56358:	3310      	adds	r3, #16
   5635a:	4563      	cmp	r3, ip
   5635c:	f941 0a0f 	vst1.8	{d16-d17}, [r1]
   56360:	d9e7      	bls.n	56332 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x1e>
   56362:	e055      	b.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   56364:	2300      	movs	r3, #0
   56366:	4573      	cmp	r3, lr
   56368:	d354      	bcc.n	56414 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x100>
   5636a:	4670      	mov	r0, lr
   5636c:	bd70      	pop	{r4, r5, r6, pc}
   5636e:	b39b      	cbz	r3, 563d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc4>
   56370:	2300      	movs	r3, #0
   56372:	18c1      	adds	r1, r0, r3
   56374:	f961 0acf 	vld1.64	{d16-d17}, [r1]
   56378:	efc9 2070 	vshr.s8	q9, q8, #7
   5637c:	ee33 1b90 	vmov.32	r1, d19[1]
   56380:	ee32 4b90 	vmov.32	r4, d18[1]
   56384:	ee13 5b90 	vmov.32	r5, d19[0]
   56388:	ee12 6b90 	vmov.32	r6, d18[0]
   5638c:	4321      	orrs	r1, r4
   5638e:	ea46 0405 	orr.w	r4, r6, r5
   56392:	4321      	orrs	r1, r4
   56394:	d13c      	bne.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   56396:	18d1      	adds	r1, r2, r3
   56398:	3310      	adds	r3, #16
   5639a:	4563      	cmp	r3, ip
   5639c:	f941 0a0f 	vst1.8	{d16-d17}, [r1]
   563a0:	d9e7      	bls.n	56372 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5e>
   563a2:	e035      	b.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563a4:	2300      	movs	r3, #0
   563a6:	18c1      	adds	r1, r0, r3
   563a8:	f961 0a0f 	vld1.8	{d16-d17}, [r1]
   563ac:	efc9 2070 	vshr.s8	q9, q8, #7
   563b0:	ee33 1b90 	vmov.32	r1, d19[1]
   563b4:	ee32 4b90 	vmov.32	r4, d18[1]
   563b8:	ee13 5b90 	vmov.32	r5, d19[0]
   563bc:	ee12 6b90 	vmov.32	r6, d18[0]
   563c0:	4321      	orrs	r1, r4
   563c2:	ea46 0405 	orr.w	r4, r6, r5
   563c6:	4321      	orrs	r1, r4
   563c8:	d122      	bne.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563ca:	18d1      	adds	r1, r2, r3
   563cc:	3310      	adds	r3, #16
   563ce:	4563      	cmp	r3, ip
   563d0:	f941 0acf 	vst1.64	{d16-d17}, [r1]
   563d4:	d9e7      	bls.n	563a6 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x92>
   563d6:	e01b      	b.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563d8:	2300      	movs	r3, #0
   563da:	18c1      	adds	r1, r0, r3
   563dc:	f961 0acf 	vld1.64	{d16-d17}, [r1]
   563e0:	efc9 2070 	vshr.s8	q9, q8, #7
   563e4:	ee33 1b90 	vmov.32	r1, d19[1]
   563e8:	ee32 4b90 	vmov.32	r4, d18[1]
   563ec:	ee13 5b90 	vmov.32	r5, d19[0]
   563f0:	ee12 6b90 	vmov.32	r6, d18[0]
   563f4:	4321      	orrs	r1, r4
   563f6:	ea46 0405 	orr.w	r4, r6, r5
   563fa:	4321      	orrs	r1, r4
   563fc:	d108      	bne.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   563fe:	18d1      	adds	r1, r2, r3
   56400:	3310      	adds	r3, #16
   56402:	4563      	cmp	r3, ip
   56404:	f941 0acf 	vst1.64	{d16-d17}, [r1]
   56408:	d9e7      	bls.n	563da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc6>
   5640a:	e001      	b.n	56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
   5640c:	54d1      	strb	r1, [r2, r3]
   5640e:	3301      	adds	r3, #1
   56410:	4573      	cmp	r3, lr
   56412:	d2aa      	bcs.n	5636a <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x56>
   56414:	56c1      	ldrsb	r1, [r0, r3]
   56416:	2900      	cmp	r1, #0
   56418:	daf8      	bge.n	5640c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xf8>
   5641a:	469e      	mov	lr, r3
   5641c:	4670      	mov	r0, lr
   5641e:	bd70      	pop	{r4, r5, r6, pc}
   56420:	4803      	ldr	r0, [pc, #12]	; (56430 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x11c>)
   56422:	2130      	movs	r1, #48	; 0x30
   56424:	4a03      	ldr	r2, [pc, #12]	; (56434 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x120>)
   56426:	4478      	add	r0, pc
   56428:	447a      	add	r2, pc
   5642a:	f000 f9bb 	bl	567a4 <_ZN3std9panicking11begin_panic17hd61fceca69156f6cE>
   5642e:	defe      	udf	#254	; 0xfe
   56430:	009a3b71 	.word	0x009a3b71
   56434:	009ebd14 	.word	0x009ebd14

The very first observation: packed_simd generates more instructions.

@hsivonen
Copy link
Owner

hsivonen commented Feb 5, 2019

OK, so the horizontal reductions generate worse code under packed_simd.

simd:

   6c17e:	f921 0a0f 	vld1.8	{d0-d1}, [r1]
   6c182:	ef89 2050 	vshr.s8	q1, q0, #7
   6c186:	ff02 2a03 	vpmax.u8	d2, d2, d3
   6c18a:	ff02 2a00 	vpmax.u8	d2, d2, d0
   6c18e:	ee12 1b10 	vmov.32	r1, d2[0]
   6c192:	2900      	cmp	r1, #0

packed_simd:

   56334:	f961 0a0f 	vld1.8	{d16-d17}, [r1]
   56338:	efc9 2070 	vshr.s8	q9, q8, #7
   5633c:	ee33 1b90 	vmov.32	r1, d19[1]
   56340:	ee32 4b90 	vmov.32	r4, d18[1]
   56344:	ee13 5b90 	vmov.32	r5, d19[0]
   56348:	ee12 6b90 	vmov.32	r6, d18[0]
   5634c:	4321      	orrs	r1, r4
   5634e:	ea46 0405 	orr.w	r4, r6, r5
   56352:	4321      	orrs	r1, r4

@hsivonen
Copy link
Owner

hsivonen commented Feb 5, 2019

Filed as a packed_simd issue.

@gnzlbg
Copy link

gnzlbg commented Feb 5, 2019

Are you using the exact same rustc version for the comparisons?

@hsivonen
Copy link
Owner

hsivonen commented Feb 5, 2019

Are you using the exact same rustc version for the comparisons?

No, because there isn't a single Rust version that both 1) compiles simd and 2) has a NEON-enabled stdlib.

@hsivonen
Copy link
Owner

hsivonen commented Feb 7, 2019

This is now fixed. Thank you for your help and patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants