Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust will memmov |self| when passed by value to inline function #42763

Open
bholley opened this issue Jun 20, 2017 · 20 comments
Open

Rust will memmov |self| when passed by value to inline function #42763

bholley opened this issue Jun 20, 2017 · 20 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bholley
Copy link
Contributor

bholley commented Jun 20, 2017

See the assembly for https://is.gd/0Owhgw . Note that you'll need to explicitly specify Release, because I can't link to Release code for some reason.

Note this bit here:

movl $400, %edx
movq %rbx, %rsi
callq memcpy@PLT

That looks like a very unnecessary and large copy.

I ran into this working on rust-selectors for stylo. In the parser, we have a very large stack-allocated SmallVec, under the assumption that it is never moved. I added some inline functions that take the buffer as |self|, and parsing got several percent slower. I rewrote the code to pass things as &mut, and the overhead went away.

Interestingly, doing:

for x in some_small_vec.into_iter() {
...
}

Doesn't have any memcpy in the disassembly, despite the fact that into_iter() takes self. Presumably some iterator optimization going on.

CC @SimonSapin @mbrubeck @Manishearth @gankro @jonathandturner

@bholley
Copy link
Contributor Author

bholley commented Jun 20, 2017

(Deleted my last comment, which was intended for somewhere else).

@bholley
Copy link
Contributor Author

bholley commented Jun 20, 2017

Actually, I was wrong. This totally affects SmallVec::into_iter() :-(

Paste the following into play: https://gist.github.com/bholley/69dcf3654dfde0ebd8bb8bd8a7c2386d

I guess using drain() is a reasonable workaround. Really bad footgun though.

@michaelwoerister
Copy link
Member

cc @rust-lang/compiler

@eddyb
Copy link
Member

eddyb commented Jun 21, 2017

cc @pcwalton Is this the same memcpy optimization problem?

@sophiajt
Copy link
Contributor

@eddyb - I'm going to show my ignorance here a bit, but why is this a memcpy optimization? If the function is inlined, shouldn't the variable be reused rather than memcpy'd? When @bholley described it, just sounded like perhaps a bug in the inliner. (though, again, it's been a while since I worked in that area)

@eddyb
Copy link
Member

eddyb commented Jun 21, 2017

@jonathandturner LLVM inlines after rustc generates the copies.

@Gankra
Copy link
Contributor

Gankra commented Jun 21, 2017

It isn't in general safe to remove these copies when an inline happens. For example modifications to the inner function's copy shouldn't be visible to the outer function's copy. Similarly pointers in the inner function may alias the outer function's copy of the value.

Rust has a lot of info that can prove eliminating these copies is safe, but llvm doesn't generally have that information. For instance you can easily argue the copy is unnecessary when T: !Copy, but I don't think llvm has any notion of move semantics or deinitializing memory.

So either we need to feed llvm stronger alias information and/or improve llvm's optimization passes, or we need to make Rust (MIR) do the inlining so it can apply its domain-specific knowledge to elide the copies.

Either way, fixing this is actually pretty non-trivial, afaict.

@sophiajt
Copy link
Contributor

For example modifications to the inner function's copy shouldn't be visible to the outer function's copy.

I'm trying to think of a situation this occurs, seems like Rust protects you:

  • move: you have ownership, so you're free to modify
  • shared borrow: you can't modify
  • mutable borrow: your changed version will be visible to the calling function

Though I guess maybe that's less the case than it seems at first blush?

@eddyb
Copy link
Member

eddyb commented Jun 21, 2017

Doesn't apply to Copy types. We did try to avoid some of the copies for non-Copy types although I'm not clear on how much that helps or if it was even kept in MIR trans.

bholley added a commit to bholley/servo that referenced this issue Jun 22, 2017
This is important because we're about to start storing a parallel list
of pushed hashes, and the current behavior here will cause mismatches
there.

We take the opportunity to bump the SmallVec size to 16, since each
entry is only a word and we really want to avoid heap-allocating. And
then we switch to drain(), because of
rust-lang/rust#42763
@Mark-Simulacrum Mark-Simulacrum added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 23, 2017
@brson
Copy link
Contributor

brson commented Jul 5, 2017

This issue and others like it apparently come up quite regularly in stylo and are something of a hindrance to achieving its promised performance. It would be good if the compiler team could think a bit about how to unblock some of the advanced optimizations necessary here. cc @nikomatsakis @rust-lang/compiler

@nikomatsakis
Copy link
Contributor

Hmm. We discussed this for some time in the @rust-lang/compiler team meeting but didn't decide on a precise course of action. It seems really important to make progress here, but we're not sure how to allocate the "person power" to make it happen, and we didn't want to call this P-high without somebody who is planning to follow-up.

One thing that I personally think would be helpful to is to drill a bit more into the details of when memmoves etc occur and if we can categorize the various cases better. Also to get a clearer idea on what sorts of annotations or other changes would help to do better optimization.

cc @arielb1, who might have thoughts on the matter

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017
@nikomatsakis
Copy link
Contributor

triage: P-medium

@robsmith11
Copy link

Are there any suggestions on ways to catch instances of this issue? I came across a very obvious case where the run-time of my long-running program increased by a factor of 15x because I did something like this in a loop:

let a = if flag { struct.large_array1 } else {struct.large_array2};
f(a[x-1], a[x], a[x+1])

I wasn't expecting a memcpy to occur here, but clearly it was copying the entire array every time I ran this code. Using &s.large_array1 fixed it.

I work with large arrays all over my code, so I'm sure more subtle instances are occurring. This seems like a pretty serious performance issue for anyone working with large Copy types.

@bholley
Copy link
Contributor Author

bholley commented May 5, 2018

You can dump the assembly and then grep for memmov with one line of before-context, which should show you the sizes of all the immediate memmovs in the binary. From there it's pretty straightforward to look for the biggest ones. I did this successfully for Firefox.

@jrmuizel
Copy link
Contributor

FWIW, I put a tool together that helps find these memcpy's: https://github.com/jrmuizel/memcpy-find

@jrmuizel
Copy link
Contributor

It looks like the memcpy in the original test case is removed when running with -Zmir-opt-level=2.

example::main:
        sub     rsp, 472
        mov     qword ptr [rsp + 24], 5
        mov     qword ptr [rsp + 32], 5
        mov     qword ptr [rsp + 40], 5
        mov     qword ptr [rsp + 48], 5
        mov     qword ptr [rsp + 56], 5
        mov     qword ptr [rsp + 64], 5
        mov     qword ptr [rsp + 72], 5
        mov     qword ptr [rsp + 80], 5
        mov     qword ptr [rsp + 88], 5
        mov     qword ptr [rsp + 96], 5
        lea     rax, [rsp + 104]
        mov     qword ptr [rsp + 104], 5
        mov     qword ptr [rsp + 112], 5
        mov     qword ptr [rsp + 120], 5
        mov     qword ptr [rsp + 128], 5
        mov     qword ptr [rsp + 136], 5
        mov     qword ptr [rsp + 144], 5
        mov     qword ptr [rsp + 152], 5
        mov     qword ptr [rsp + 160], 5
        mov     qword ptr [rsp + 168], 5
        mov     qword ptr [rsp + 176], 5
        mov     qword ptr [rsp + 184], 5
        mov     qword ptr [rsp + 192], 5
        mov     qword ptr [rsp + 200], 5
        mov     qword ptr [rsp + 208], 5
        mov     qword ptr [rsp + 216], 5
        mov     qword ptr [rsp + 224], 5
        mov     qword ptr [rsp + 232], 5
        mov     qword ptr [rsp + 240], 5
        mov     qword ptr [rsp + 248], 5
        mov     qword ptr [rsp + 256], 5
        mov     qword ptr [rsp + 264], 5
        mov     qword ptr [rsp + 272], 5
        mov     qword ptr [rsp + 280], 5
        mov     qword ptr [rsp + 288], 5
        mov     qword ptr [rsp + 296], 5
        mov     qword ptr [rsp + 304], 5
        mov     qword ptr [rsp + 312], 5
        mov     qword ptr [rsp + 320], 5
        mov     qword ptr [rsp + 328], 5
        mov     qword ptr [rsp + 336], 5
        mov     qword ptr [rsp + 344], 5
        mov     qword ptr [rsp + 352], 5
        mov     qword ptr [rsp + 360], 5
        mov     qword ptr [rsp + 368], 5
        mov     qword ptr [rsp + 376], 5
        mov     qword ptr [rsp + 384], 5
        mov     qword ptr [rsp + 392], 5
        mov     qword ptr [rsp + 400], 5
        mov     qword ptr [rsp + 408], 5
        mov     qword ptr [rsp + 416], 5
        lea     rcx, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 424], rcx
        mov     qword ptr [rsp + 432], 2
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
        mov     qword ptr [rsp + 16], rax
        mov     qword ptr [rsp + 440], 0
        lea     rax, [rsp + 8]
        mov     qword ptr [rsp + 456], rax
        mov     qword ptr [rsp + 464], 1
        lea     rdi, [rsp + 424]
        call    qword ptr [rip + std::io::stdio::_print@GOTPCREL]
        add     rsp, 472
        ret

@bholley
Copy link
Contributor Author

bholley commented Jan 15, 2021

Is that just replacing the memcpy with an inline series of moves? If so I hope we can do better. :-)

@jrmuizel
Copy link
Contributor

No. There are 50 movs to initialize with 5 and only 8 other movs to set up parameters for the call to print. It would be better if the initialization of the array could get removed too, but that's another issue.

Here's what it looks like by default without -Zmir-opt-level=2

example::main:
        push    r14
        push    rbx
        sub     rsp, 872
        mov     qword ptr [rsp + 24], 5
        mov     qword ptr [rsp + 32], 5
        mov     qword ptr [rsp + 40], 5
        mov     qword ptr [rsp + 48], 5
        mov     qword ptr [rsp + 56], 5
        mov     qword ptr [rsp + 64], 5
        mov     qword ptr [rsp + 72], 5
        mov     qword ptr [rsp + 80], 5
        mov     qword ptr [rsp + 88], 5
        mov     qword ptr [rsp + 96], 5
        mov     qword ptr [rsp + 104], 5
        mov     qword ptr [rsp + 112], 5
        mov     qword ptr [rsp + 120], 5
        mov     qword ptr [rsp + 128], 5
        mov     qword ptr [rsp + 136], 5
        mov     qword ptr [rsp + 144], 5
        mov     qword ptr [rsp + 152], 5
        mov     qword ptr [rsp + 160], 5
        mov     qword ptr [rsp + 168], 5
        mov     qword ptr [rsp + 176], 5
        mov     qword ptr [rsp + 184], 5
        mov     qword ptr [rsp + 192], 5
        mov     qword ptr [rsp + 200], 5
        mov     qword ptr [rsp + 208], 5
        mov     qword ptr [rsp + 216], 5
        mov     qword ptr [rsp + 224], 5
        mov     qword ptr [rsp + 232], 5
        mov     qword ptr [rsp + 240], 5
        mov     qword ptr [rsp + 248], 5
        mov     qword ptr [rsp + 256], 5
        mov     qword ptr [rsp + 264], 5
        mov     qword ptr [rsp + 272], 5
        mov     qword ptr [rsp + 280], 5
        mov     qword ptr [rsp + 288], 5
        mov     qword ptr [rsp + 296], 5
        mov     qword ptr [rsp + 304], 5
        mov     qword ptr [rsp + 312], 5
        mov     qword ptr [rsp + 320], 5
        mov     qword ptr [rsp + 328], 5
        mov     qword ptr [rsp + 336], 5
        mov     qword ptr [rsp + 344], 5
        mov     qword ptr [rsp + 352], 5
        mov     qword ptr [rsp + 360], 5
        mov     qword ptr [rsp + 368], 5
        mov     qword ptr [rsp + 376], 5
        mov     qword ptr [rsp + 384], 5
        mov     qword ptr [rsp + 392], 5
        mov     qword ptr [rsp + 400], 5
        mov     qword ptr [rsp + 408], 5
        mov     qword ptr [rsp + 416], 5
        lea     r14, [rsp + 472]
        lea     rsi, [rsp + 24]
        mov     rbx, qword ptr [rip + memcpy@GOTPCREL]
        mov     edx, 400
        mov     rdi, r14
        call    rbx
        lea     rdi, [rsp + 24]
        mov     edx, 400
        mov     rsi, r14
        call    rbx
        lea     rax, [rsp + 104]
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
        mov     qword ptr [rsp + 16], rax
        lea     rax, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 424], rax
        mov     qword ptr [rsp + 432], 2
        mov     qword ptr [rsp + 440], 0
        lea     rax, [rsp + 8]
        mov     qword ptr [rsp + 456], rax
        mov     qword ptr [rsp + 464], 1
        lea     rdi, [rsp + 424]
        call    qword ptr [rip + std::io::stdio::_print@GOTPCREL]
        add     rsp, 872
        pop     rbx
        pop     r14
        ret

The initialization is there along with 2 400 byte memcpys.

@erikdesjardins
Copy link
Contributor

Looks like nightly (probably #82806) removes the memcpy from the original example (https://godbolt.org/z/nGfxaj), but not from the smallvec example (https://godbolt.org/z/Y34K36)

cc @nikic

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
@erikdesjardins
Copy link
Contributor

erikdesjardins commented Feb 24, 2024

The memcpy was removed from the original example in 1.52: https://godbolt.org/z/5fnGqG1KG.
The array initialization is still there, but that's gonna be hard to remove since println! takes a reference to the array element. It disappears if you force a move: https://godbolt.org/z/hYWT6deWo.

The memcpy was removed from the smallvec example in 1.70: https://godbolt.org/z/6oGcfMaor.
There's still some unnecessary stack traffic that seems to be due to phase ordering: https://godbolt.org/z/9o8e6eqPc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests