Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code generated by wasmtime doesn't cache-align loops #4883

Open
koute opened this issue Sep 8, 2022 · 2 comments · May be fixed by #5004
Open

Code generated by wasmtime doesn't cache-align loops #4883

koute opened this issue Sep 8, 2022 · 2 comments · May be fixed by #5004
Labels
cranelift:E-easy Issues suitable for newcomers to investigate, including Rust newcomers! cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift Issues related to the Cranelift code generator enhancement performance

Comments

@koute
Copy link
Contributor

koute commented Sep 8, 2022

The problem

Currently wasmtime/cranelift (unlike e.g. LLVM which doesn't have this problem AFAIK) doesn't cache-align the loops it generates, leading to potentially huge performance regressions if a hot loop ends up accidentally spanning over multiple cache lines.

Background

Recently we were updating from wasmtime 0.38 to 0.40 and we saw a peculiar performance regression when doing so. One of our benchmarks took almost 2x the time to run, with a lot of them taking around ~45% more time. A huge regression. Ultimately it ended up being unrelated to the 0.38 -> 0.40 upgrade. We tracked the problem down to memset within the WASM (we're currently not using the bulk memory ops extension) suddenly taking a lot more time to run for no apparent reason. Depending on which exact address wasmtime decided to generate the code for memset at (which is essentially random, although consistent for the same code with the same flags in the same environment) the benchmarks were either slow, or fast, and it all boiled down to whether the hot loop of the memset spanned multiple cache lines or not.

You can find a detailed analysis of the problem in this comment and this comment of mine.

@cfallin
Copy link
Member

cfallin commented Sep 8, 2022

Thanks for tracking this down, @koute! Yes, I agree that aligning loop headers to cache-line boundaries makes sense. Probably as a compile-time option, when opts are enabled (debug code is going to be substantially more bloated for other reasons so we don't want to inflate further, and is going to be slow anyway).

@cfallin
Copy link
Member

cfallin commented Sep 8, 2022

This might be a good starter issue for someone to tackle. The main steps I see this taking are:

  • Convey a notion of "loop header block" to lowered blocks in the VCode. This information can be obtained from the loop analysis, or perhaps more simply and with less overhead, by detecting backedges (branch from higher-index block to lower-index block) when lowering code and marking the target as a header. (The latter is precise for reducible control flow and approximate but pretty good for irreducible control flow.) This would probably best be done somewhere around here and could insert the block index into a set held by the VCode.
  • When we reach a loop header block in VCode::emit, use the stricter loop-header alignment (64 bytes probably?) here rather than the default basic-block alignment, which is 1 byte on x86-64.

If no one else wants to take it, I can do this at some point but I thought I would put this out there first!

@cfallin cfallin added cranelift:E-easy Issues suitable for newcomers to investigate, including Rust newcomers! performance labels Sep 8, 2022
@akirilov-arm akirilov-arm added enhancement cranelift Issues related to the Cranelift code generator cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. labels Sep 12, 2022
@pepyakin pepyakin linked a pull request Oct 4, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:E-easy Issues suitable for newcomers to investigate, including Rust newcomers! cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift Issues related to the Cranelift code generator enhancement performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants