Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in tight loop since rust 1.25 #53340

Open
pedrocr opened this issue Aug 14, 2018 · 34 comments · Fixed by #86823
Open

Performance regression in tight loop since rust 1.25 #53340

pedrocr opened this issue Aug 14, 2018 · 34 comments · Fixed by #86823
Assignees
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@pedrocr
Copy link

pedrocr commented Aug 14, 2018

I've finally gotten around to doing some proper benchmarking of rust versions for my crate:

http://chimper.org/rawloader-rustc-benchmarks/

As can be seen in the graph on that page there's a general performance improvement over time but there are some very negative outliers. Most (maybe all) of them seem to be very simple loops that decode packed formats. Since rust 1.25 those are seeing 30-40% degradations in performance. I've extracted a minimal test case that shows the issue:

fn decode_12le(buf: &[u8], width: usize, height: usize) -> Vec<u16> {
  let mut out: Vec<u16> = vec![0; width*height];

  for (row, line) in out.chunks_mut(width).enumerate() {
    let inb = &buf[(row*width*12/8)..];

    for (o, i) in line.chunks_mut(2).zip(inb.chunks(3)) {
      let g1: u16 = i[0] as u16;
      let g2: u16 = i[1] as u16;
      let g3: u16 = i[2] as u16;

      o[0] = ((g2 & 0x0f) << 8) | g1;
      o[1] = (g3 << 4) | (g2 >> 4);
    }
  }
  out
}

fn main() {
  let width = 5000;
  let height = 4000;

  let buffer: Vec<u8> = vec![0; width*height*12/8];
  
  for _ in 0..100 {
    decode_12le(&buffer, width, height);
  }
}

Here's a test run on my machine:

$ rustc +1.24.0 -C opt-level=3 bench_decode.rs 
$ time ./bench_decode 

real	0m4.817s
user	0m3.581s
sys	0m1.236s
$ rustc +1.25.0 -C opt-level=3 bench_decode.rs 
$ time ./bench_decode 

real	0m6.263s
user	0m5.067s
sys	0m1.196s
@pedrocr
Copy link
Author

pedrocr commented Aug 14, 2018

godbolt shows quite a big diff between 1.24 and 1.25:

https://godbolt.org/g/fbpuHp

@sanxiyn sanxiyn added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 17, 2018
@pedrocr
Copy link
Author

pedrocr commented Oct 27, 2018

I've scripted the checking of this across versions and the regression is present all the way through to nightly:

packed12le : 1.24.1  BASE 3.53
packed12le : 1.25.0  FAIL 4.84 (+37%)
packed12le : 1.26.2  FAIL 4.80 (+35%)
packed12le : 1.27.2  FAIL 4.81 (+36%)
packed12le : 1.28.0  FAIL 4.87 (+37%)
packed12le : 1.29.2  FAIL 4.77 (+35%)
packed12le : 1.30.0  FAIL 4.83 (+36%)
packed12le : beta    FAIL 4.83 (+36%)
packed12le : nightly FAIL 4.95 (+40%)

The 35-40% increase in runtime is very consistent.

@pedrocr
Copy link
Author

pedrocr commented Dec 15, 2018

The regression is still present in 1.31 and all the way to current nightly:

packed12le : 1.24.1  BASE 3.51
packed12le : 1.25.0  FAIL 4.84 (+37%)
packed12le : 1.26.2  FAIL 4.83 (+37%)
packed12le : 1.27.2  FAIL 4.80 (+36%)
packed12le : 1.28.0  FAIL 4.86 (+38%)
packed12le : 1.29.2  FAIL 4.98 (+41%)
packed12le : 1.30.1  FAIL 5.00 (+42%)
packed12le : 1.31.0  FAIL 4.90 (+39%)
packed12le : beta    FAIL 4.88 (+39%)
packed12le : nightly FAIL 4.92 (+40%)

The same ~40% regression seen in the minimal test case is also seen in the full benchmark:

http://chimper.org/rawloader-rustc-benchmarks/version-1.31.0.html
(see the bottom of the page)

@bluss
Copy link
Member

bluss commented Dec 19, 2018

@pedrocr Thanks for careful benchmarks. I'd suspect that the zip specialization for chunks mut is causing this, it was introduced between 1.24 and 1.25 in PR #47142

It's not far fetched that it's not actually an optimization for these iterators, and that the implementation should be revisited.

@bluss
Copy link
Member

bluss commented Dec 19, 2018

What's the performance if you compare this version with something based on the newer chunks_exact/_mut ?

@pedrocr
Copy link
Author

pedrocr commented Dec 19, 2018

Using chunks_exact_mut the regression is completely reverted and converted into a ~10% improvement:

packed12le : 1.24.1  BASE 3.53
packed12le : 1.25.0  FAIL 4.93 (+39%)
packed12le : 1.26.2  FAIL 4.82 (+36%)
packed12le : 1.27.2  FAIL 4.85 (+37%)
packed12le : 1.28.0  FAIL 4.93 (+39%)
packed12le : 1.29.2  FAIL 4.94 (+39%)
packed12le : 1.30.1  FAIL 5.03 (+42%)
packed12le : 1.31.0  OK   3.19 (-9%)
packed12le : beta    OK   3.18 (-9%)
packed12le : nightly OK   3.08 (-12%)

Starting with 1.31 the chunks_exact_mut is used instead of chunks_mut. At some point I'll probably make 1.31 (and Rust 2018) a new baseline required version to be able to use this. But it's probably nice to fix this regression anyway as the exact versions are not always usable.

@nikic nikic added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Dec 19, 2018
@bluss
Copy link
Member

bluss commented Dec 21, 2018

@pedrocr just to clarify, did you update all occurrences of chunks/_mut to be the exact version?

@pedrocr
Copy link
Author

pedrocr commented Dec 21, 2018

@bluss Both the inner and outer loop. Here's the full code:

fn decode_12le(buf: &[u8], width: usize, height: usize) -> Vec<u16> {
  let mut out: Vec<u16> = vec![0; width*height];

  for (row, line) in out.chunks_exact_mut(width).enumerate() {
    let inb = &buf[(row*width*12/8)..];

    for (o, i) in line.chunks_exact_mut(2).zip(inb.chunks(3)) {
      let g1: u16 = i[0] as u16;
      let g2: u16 = i[1] as u16;
      let g3: u16 = i[2] as u16;

      o[0] = ((g2 & 0x0f) << 8) | g1;
      o[1] = (g3 << 4) | (g2 >> 4);
    }
  }
  out
}

fn main() {
  let width = 5000;
  let height = 4000;

  let mut buffer: Vec<u8> = vec![0; width*height*12/8];
  // Make sure we don't get optimized out by writing some data into the buffer
  for (i, val) in buffer.chunks_mut(1).enumerate() {
    val[0] = i as u8;
  }
  
  for _ in 0..100 {
    decode_12le(&buffer, width, height);
  }
}

I've also initialized the buffer with some data to avoid optimization. I think that became needed after one of the LLVM upgrades since 1.25.

@bluss
Copy link
Member

bluss commented Dec 21, 2018

There's still a chunks in there, why not try with it exact too?

@pedrocr
Copy link
Author

pedrocr commented Dec 21, 2018

I did, and it doesn't make a difference, it's just the initialization which doesn't take much time. I didn't change that one because it's not in the code that's actually being benchmarked, but it doesn't really make a difference.

@bluss
Copy link
Member

bluss commented Dec 21, 2018

Maybe change the chunks to chunks exact on this line. I'm just curious.

for (o, i) in line.chunks_exact_mut(2).zip(inb.chunks(3)) {

@pedrocr
Copy link
Author

pedrocr commented Dec 21, 2018

Ah, right, missed that one. I'll check.

@pedrocr
Copy link
Author

pedrocr commented Dec 21, 2018

It either works extremely well or somehow LLVM figured out how to optimize away too much:

packed12le : 1.24.1  BASE 3.46
packed12le : 1.25.0  FAIL 5.00 (+44%)
packed12le : 1.26.2  FAIL 4.87 (+40%)
packed12le : 1.27.2  FAIL 4.91 (+41%)
packed12le : 1.28.0  FAIL 4.83 (+39%)
packed12le : 1.29.2  FAIL 4.99 (+44%)
packed12le : 1.30.1  FAIL 5.01 (+44%)
packed12le : 1.31.0  OK   1.80 (-47%)
packed12le : beta    OK   1.80 (-47%)
packed12le : nightly OK   1.83 (-47%)

@bluss
Copy link
Member

bluss commented Dec 21, 2018

That's exactly what we want :)

@bluss
Copy link
Member

bluss commented Dec 21, 2018

The exact chunks code looks ok in godbolt. Nothing spectacular, just a clean inner loop with no redundant bounds checks, and a single loop exit conditional.

A minor boring trick to the old code is to change the order to this:

      let g3: u16 = i[2] as u16;
      let g2: u16 = i[1] as u16;
      let g1: u16 = i[0] as u16;

      o[1] = (g3 << 4) | (g2 >> 4);
      o[0] = ((g2 & 0x0f) << 8) | g1;

With the bounds check at i[2], the other bound checks on i become redundant etc. But the exact chunks versions are much better: no bound checks. Not sure that bound checks are the biggest drag: the old loop also has a jumble of conditional moves that compute each slice's length.

@pedrocr
Copy link
Author

pedrocr commented Dec 21, 2018

The 10% improvement was already interesting but the 40% one definitely made me want to use this. I have almost 50 chunks() and chunks_mut() like these in rawloader so I'll definitely be benchmarking that. If I remember correctly the original C++ code did have better performance in these kinds of very simple formats so this probably closes one of the few gaps I saw in performance when writing the rust code. Just need to figure out what to do about older versions of rust. Maybe just have a dummy implementation that calls the non-exact versions.

The bounds check trick is interesting but a little disappointing that the compiler doesn't figure that out itself. I've intentionally kept the code clean instead of trying to make it fast by being clever. It seemed to pay well in terms of productivity and maintainability.

But I don't think this closes the regression itself though, or does it? I assume chunks_exact() doesn't always fit and there's performance to be gained in other cases.

@bluss
Copy link
Member

bluss commented Dec 21, 2018

I agree, it looks like we should try to fix the performance of chunks.zip(chunks) where it's any combination of chunks/chunks_mut, to fix this regression.

@sfackler
Copy link
Member

The bounds check trick is interesting but a little disappointing that the compiler doesn't figure that out itself.

It can't make that change itself since it would change the visible behavior of the program: the panic message includes the index.

@pedrocr
Copy link
Author

pedrocr commented Dec 21, 2018

@sfackler ah, that's annoying but makes perfect sense, thanks.

@pedrocr
Copy link
Author

pedrocr commented Jan 16, 2019

1.32 maintains the same performance:

packed12le : 1.24.1  BASE 3.48
packed12le : 1.25.0  FAIL 4.93 (+41%)
packed12le : 1.26.2  FAIL 4.87 (+39%)
packed12le : 1.27.2  FAIL 4.87 (+39%)
packed12le : 1.28.0  FAIL 4.89 (+40%)
packed12le : 1.29.2  FAIL 4.99 (+43%)
packed12le : 1.30.1  FAIL 5.00 (+43%)
packed12le : 1.31.1  FAIL 4.95 (+42%)
packed12le : 1.32.0  FAIL 4.90 (+40%)
packed12le : beta    FAIL 4.91 (+41%)
packed12le : nightly FAIL 4.88 (+40%)

The chunks_exact version is still very fast.

@pedrocr
Copy link
Author

pedrocr commented Feb 25, 2019

1.33 maintains the same regression:

packed12le : 1.24.1  BASE 3.77
packed12le : 1.25.0  FAIL 5.21 (+38%)
packed12le : 1.26.2  FAIL 5.15 (+36%)
packed12le : 1.27.2  FAIL 5.13 (+36%)
packed12le : 1.28.0  FAIL 5.11 (+35%)
packed12le : 1.29.2  FAIL 5.25 (+39%)
packed12le : 1.30.1  FAIL 5.25 (+39%)
packed12le : 1.31.1  FAIL 5.16 (+36%)
packed12le : 1.32.0  FAIL 5.13 (+36%)
packed12le : 1.33.0  FAIL 5.12 (+35%)
packed12le : beta    FAIL 5.17 (+37%)
packed12le : nightly FAIL 5.19 (+37%)

@jonas-schievink jonas-schievink added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Mar 28, 2019
@pedrocr
Copy link
Author

pedrocr commented May 21, 2019

I don't know if this is useful but 1.34/1.35 and current beta/nightly still have the same regression:

packed12le : 1.24.1  BASE 3.59
packed12le : 1.25.0  FAIL 4.97 (+38%)
packed12le : 1.26.2  FAIL 4.89 (+36%)
packed12le : 1.27.2  FAIL 4.98 (+38%)
packed12le : 1.28.0  FAIL 4.99 (+38%)
packed12le : 1.29.2  FAIL 5.15 (+43%)
packed12le : 1.30.1  FAIL 4.95 (+37%)
packed12le : 1.31.1  FAIL 5.03 (+40%)
packed12le : 1.32.0  FAIL 5.02 (+39%)
packed12le : 1.33.0  FAIL 5.10 (+42%)
packed12le : 1.34.2  FAIL 5.08 (+41%)
packed12le : 1.35.0  FAIL 5.01 (+39%)
packed12le : beta    FAIL 5.06 (+40%)
packed12le : nightly FAIL 5.16 (+43%)

@pedrocr
Copy link
Author

pedrocr commented Sep 27, 2019

1.38 recovers roughly half this regression:

packed12le : 1.24.1  BASE 3.67
packed12le : 1.25.0  FAIL 5.09 (+38%)
packed12le : 1.26.2  FAIL 5.19 (+41%)
packed12le : 1.27.2  FAIL 5.10 (+38%)
packed12le : 1.28.0  FAIL 5.15 (+40%)
packed12le : 1.29.2  FAIL 5.24 (+42%)
packed12le : 1.30.1  FAIL 5.10 (+38%)
packed12le : 1.31.1  FAIL 5.27 (+43%)
packed12le : 1.32.0  FAIL 5.16 (+40%)
packed12le : 1.33.0  FAIL 5.19 (+41%)
packed12le : 1.34.2  FAIL 5.35 (+45%)
packed12le : 1.35.0  FAIL 5.18 (+41%)
packed12le : 1.36.0  FAIL 5.20 (+41%)
packed12le : 1.37.0  FAIL 5.11 (+39%)
packed12le : 1.38.0  FAIL 4.50 (+22%)
packed12le : beta    FAIL 4.50 (+22%)
packed12le : nightly FAIL 4.48 (+22%)

The chunks_exact version is unchanged so it doesn't seem to be some other thing that has improved to make this faster.

@pedrocr
Copy link
Author

pedrocr commented Jun 3, 2021

As an update it seems the latest rust versions have gotten this down to a ~4% penalty only:

packed12le : 1.24.1  BASE 4.69
packed12le : 1.25.0  FAIL 6.36 (+35%)
packed12le : 1.26.2  FAIL 6.34 (+35%)
packed12le : 1.27.2  FAIL 6.07 (+29%)
packed12le : 1.28.0  FAIL 6.03 (+28%)
packed12le : 1.29.2  FAIL 6.08 (+29%)
packed12le : 1.30.1  FAIL 6.12 (+30%)
packed12le : 1.31.1  FAIL 6.03 (+28%)
packed12le : 1.32.0  FAIL 6.09 (+29%)
packed12le : 1.33.0  FAIL 6.06 (+29%)
packed12le : 1.34.2  FAIL 6.01 (+28%)
packed12le : 1.35.0  FAIL 6.04 (+28%)
packed12le : 1.36.0  FAIL 6.01 (+28%)
packed12le : 1.37.0  FAIL 6.12 (+30%)
packed12le : 1.38.0  FAIL 5.32 (+13%)
packed12le : 1.39.0  FAIL 5.39 (+14%)
packed12le : 1.40.0  FAIL 5.68 (+21%)
packed12le : 1.41.1  FAIL 5.75 (+22%)
packed12le : 1.42.0  FAIL 5.27 (+12%)
packed12le : 1.43.1  FAIL 5.34 (+13%)
packed12le : 1.44.1  FAIL 5.32 (+13%)
packed12le : 1.45.2  FAIL 5.31 (+13%)
packed12le : 1.46.0  FAIL 5.32 (+13%)
packed12le : 1.47.0  FAIL 5.28 (+12%)
packed12le : 1.48.0  FAIL 4.95 (+5%)
packed12le : 1.49.0  FAIL 5.35 (+14%)
packed12le : 1.50.0  FAIL 4.93 (+5%)
packed12le : 1.51.0  FAIL 4.93 (+5%)
packed12le : 1.52.1  FAIL 4.90 (+4%)
packed12le : beta    FAIL 4.92 (+4%)
packed12le : nightly FAIL 4.88 (+4%)

@m-ou-se m-ou-se added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 23, 2021
@bstrie
Copy link
Contributor

bstrie commented Jul 3, 2021

You say the modern versions have almost closed the gap in performance; can you look at the modern assembly and see what still might be worse than 1.24, and what has improved since 1.25?

@pedrocr
Copy link
Author

pedrocr commented Jul 3, 2021

My knowledge of x86 assembly is rudimentary, sorry.

@the8472
Copy link
Member

the8472 commented Jul 3, 2021

I have a potential fix in #86823 if you want to benchmark it you can grab the try build for 5c392fe307a7b9c6ca1d328ad7dbed69fb03897d

@pedrocr
Copy link
Author

pedrocr commented Jul 5, 2021

Does the build system save its artifacts anywhere? I don't think I can currently build rustc on this machine.

@the8472
Copy link
Member

the8472 commented Jul 5, 2021

You can use https://github.com/kennytm/rustup-toolchain-install-master to install CI builds as rustup toolchain.

@pedrocr
Copy link
Author

pedrocr commented Jul 6, 2021

That tool worked really well. Gave me an installed toolchain and then the benchmarking automation just worked. Here's the result of running that branch compared to the other recent toolchains:

packed12le : 1.24.1  BASE 4.59
packed12le : 1.52.1  FAIL 5.00 (+8%)
packed12le : 1.53.0  FAIL 4.91 (+6%)
packed12le : beta    FAIL 4.91 (+6%)
packed12le : nightly FAIL 5.06 (+10%)
packed12le : 5c392fe OK   4.21 (-8%)

Hopefully the patch doesn't have any soundness issues because it seems to fix things completely. The same code now becomes ~8% faster.

@pedrocr
Copy link
Author

pedrocr commented Dec 8, 2022

Just as a confirmation for this here are full results:

packed12le : 1.24.1  BASE 4.50
packed12le : 1.25.0  FAIL 6.39 (+41%)
packed12le : 1.26.2  FAIL 6.31 (+40%)
packed12le : 1.27.2  FAIL 6.01 (+33%)
packed12le : 1.28.0  FAIL 5.95 (+32%)
packed12le : 1.29.2  FAIL 5.95 (+32%)
packed12le : 1.30.1  FAIL 6.07 (+34%)
packed12le : 1.31.1  FAIL 6.01 (+33%)
packed12le : 1.32.0  FAIL 5.99 (+33%)
packed12le : 1.33.0  FAIL 5.97 (+32%)
packed12le : 1.34.2  FAIL 6.02 (+33%)
packed12le : 1.35.0  FAIL 5.98 (+32%)
packed12le : 1.36.0  FAIL 6.05 (+34%)
packed12le : 1.37.0  FAIL 6.01 (+33%)
packed12le : 1.38.0  FAIL 5.22 (+15%)
packed12le : 1.39.0  FAIL 5.30 (+17%)
packed12le : 1.40.0  FAIL 5.60 (+24%)
packed12le : 1.41.1  FAIL 5.69 (+26%)
packed12le : 1.42.0  FAIL 5.29 (+17%)
packed12le : 1.43.1  FAIL 5.22 (+15%)
packed12le : 1.44.1  FAIL 5.31 (+17%)
packed12le : 1.45.2  FAIL 5.23 (+16%)
packed12le : 1.46.0  FAIL 5.22 (+15%)
packed12le : 1.47.0  FAIL 5.21 (+15%)
packed12le : 1.48.0  FAIL 4.91 (+9%)
packed12le : 1.49.0  FAIL 5.18 (+15%)
packed12le : 1.50.0  FAIL 4.87 (+8%)
packed12le : 1.51.0  FAIL 4.86 (+8%)
packed12le : 1.52.1  FAIL 4.84 (+7%)
packed12le : 1.53.0  FAIL 4.85 (+7%)
packed12le : 1.54.0  FAIL 4.88 (+8%)
packed12le : 1.55.0  OK   4.14 (-8%)
packed12le : 1.56.1  OK   4.15 (-7%)
packed12le : 1.57.0  OK   3.47 (-22%)
packed12le : 1.58.1  OK   4.16 (-7%)
packed12le : 1.59.0  OK   4.05 (-10%)
packed12le : 1.60.0  OK   3.42 (-24%)
packed12le : 1.61.0  OK   4.16 (-7%)
packed12le : 1.62.1  OK   3.42 (-24%)
packed12le : 1.63.0  OK   4.16 (-7%)
packed12le : 1.64.0  OK   4.09 (-9%)
packed12le : 1.65.0  OK   4.49 (-0%)
packed12le : beta    OK   4.50 (+0%)
packed12le : nightly OK   4.50 (+0%)

Some recent versions were actually able to reach 20%+ performance improvements but now we're back to 0. Possibly that's a different performance improvement/regression going on.

@the8472 the8472 reopened this Dec 8, 2022
@the8472 the8472 self-assigned this Dec 8, 2022
@scottmcm
Copy link
Member

scottmcm commented Jan 5, 2023

This might be related to whether it auto-vectorizes.

This version gets a nice one: https://rust.godbolt.org/z/TMKvzozjv

  %58 = zext <4 x i8> %57 to <4 x i16>, !dbg !283
  %59 = shl nuw <4 x i16> %45, <i16 8, i16 8, i16 8, i16 8>, !dbg !314
  %60 = and <4 x i16> %59, <i16 3840, i16 3840, i16 3840, i16 3840>, !dbg !314
  %61 = or <4 x i16> %60, %32, !dbg !316
  %62 = shl nuw nsw <4 x i16> %58, <i16 4, i16 4, i16 4, i16 4>, !dbg !317
  %63 = lshr <4 x i16> %45, <i16 4, i16 4, i16 4, i16 4>, !dbg !318
  %64 = or <4 x i16> %62, %63, !dbg !319

So that would plausibly be a reason for the 40% improvement mentioned in #53340 (comment)

But peeling the last iteration from the non-exact is probably hard, losing the auto-vectorization when exact things aren't used.

@pedrocr
Copy link
Author

pedrocr commented Mar 29, 2024

This can probably be closed. The regression seems to be solved definitely since 1.55 and now hovers around 10 to 20% improvement depending on version:

packed12le : 1.24.1  BASE 1.58
packed12le : 1.25.0  FAIL 2.12 (+34%)
packed12le : 1.26.2  FAIL 2.06 (+30%)
packed12le : 1.27.2  FAIL 2.07 (+31%)
packed12le : 1.28.0  FAIL 2.08 (+31%)
packed12le : 1.29.2  FAIL 2.08 (+31%)
packed12le : 1.30.1  FAIL 2.13 (+34%)
packed12le : 1.31.1  FAIL 2.17 (+37%)
packed12le : 1.32.0  FAIL 2.14 (+35%)
packed12le : 1.33.0  FAIL 2.11 (+33%)
packed12le : 1.34.2  FAIL 2.13 (+34%)
packed12le : 1.35.0  FAIL 2.11 (+33%)
packed12le : 1.36.0  FAIL 2.12 (+34%)
packed12le : 1.37.0  FAIL 2.10 (+32%)
packed12le : 1.38.0  FAIL 1.81 (+14%)
packed12le : 1.39.0  FAIL 1.79 (+13%)
packed12le : 1.40.0  FAIL 1.81 (+14%)
packed12le : 1.41.1  FAIL 1.86 (+17%)
packed12le : 1.42.0  FAIL 1.81 (+14%)
packed12le : 1.43.1  FAIL 1.81 (+14%)
packed12le : 1.44.1  FAIL 1.83 (+15%)
packed12le : 1.45.2  FAIL 1.80 (+13%)
packed12le : 1.46.0  FAIL 1.77 (+12%)
packed12le : 1.47.0  FAIL 1.72 (+8%)
packed12le : 1.48.0  FAIL 1.69 (+6%)
packed12le : 1.49.0  FAIL 1.68 (+6%)
packed12le : 1.50.0  FAIL 1.69 (+6%)
packed12le : 1.51.0  FAIL 1.73 (+9%)
packed12le : 1.52.1  FAIL 1.67 (+5%)
packed12le : 1.53.0  FAIL 1.66 (+5%)
packed12le : 1.54.0  FAIL 1.63 (+3%)
packed12le : 1.55.0  OK   1.43 (-9%)
packed12le : 1.56.1  OK   1.38 (-12%)
packed12le : 1.57.0  OK   1.37 (-13%)
packed12le : 1.58.1  OK   1.32 (-16%)
packed12le : 1.59.0  OK   1.30 (-17%)
packed12le : 1.60.0  OK   1.40 (-11%)
packed12le : 1.61.0  OK   1.40 (-11%)
packed12le : 1.62.1  OK   1.37 (-13%)
packed12le : 1.63.0  OK   1.34 (-15%)
packed12le : 1.64.0  OK   1.42 (-10%)
packed12le : 1.65.0  OK   1.42 (-10%)
packed12le : 1.66.0  OK   1.46 (-7%)
packed12le : 1.67.1  OK   1.41 (-10%)
packed12le : 1.68.2  OK   1.42 (-10%)
packed12le : 1.69.0  OK   1.43 (-9%)
packed12le : 1.70.0  OK   1.46 (-7%)
packed12le : 1.71.1  OK   1.45 (-8%)
packed12le : 1.72.1  OK   1.46 (-7%)
packed12le : 1.73.0  OK   1.41 (-10%)
packed12le : 1.74.1  OK   1.42 (-10%)
packed12le : 1.75.0  OK   1.44 (-8%)
packed12le : 1.76.0  OK   1.44 (-8%)
packed12le : 1.77.0  OK   1.42 (-10%)
packed12le : beta    OK   1.39 (-12%)
packed12le : nightly OK   1.24 (-21%)

@the8472
Copy link
Member

the8472 commented Mar 29, 2024

Nice, adding a codegen test might be useful though since it seems like a fickle optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants