Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Upgrade wasmtime to 0.40.0 #12096

Merged
merged 9 commits into from
Sep 8, 2022
Merged

Upgrade wasmtime to 0.40.0 #12096

merged 9 commits into from
Sep 8, 2022

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Aug 23, 2022

Closes: #12073

@bkchr bkchr added A2-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Aug 23, 2022
@bkchr bkchr requested review from koute and pepyakin August 23, 2022 21:25
@koute
Copy link
Contributor

koute commented Aug 25, 2022

Looking at wasmtime release notes this line got my attention:

Many more instructions are now implemented with ISLE instead of handwritten lowerings.

So I ran our basic benchmarks; those are mostly unchanged, all good. But then I ran a small subset of our pallet benches, and it seems like there are some huge performance regressions here. (These numbers are the newest master vs this PR.)

The full table of all of the benches I ran:

Benchmark results
time %
pallet_contracts/on_process_deletion_queue_batch 103.3
pallet_contracts/on_initialize_per_trie_key 102.2
pallet_contracts/on_initialize_per_queue_item 100.7
pallet_contracts/reinstrument 99.9
pallet_contracts/call_with_code_per_byte 113.2
pallet_contracts/instantiate_with_code 105.2
pallet_contracts/instantiate 113.1
pallet_contracts/call 140.1
pallet_contracts/upload_code 102.3
pallet_contracts/remove_code 105.1
pallet_contracts/set_code 103.0
pallet_contracts/seal_caller 127.9
pallet_contracts/seal_is_contract 112.9
pallet_contracts/seal_code_hash 114.5
pallet_contracts/seal_own_code_hash 124.6
pallet_contracts/seal_caller_is_origin 139.8
pallet_contracts/seal_address 128.9
pallet_contracts/seal_gas_left 129.4
pallet_contracts/seal_balance 120.3
pallet_contracts/seal_value_transferred 130.6
pallet_contracts/seal_minimum_balance 130.2
pallet_contracts/seal_block_number 129.8
pallet_contracts/seal_now 129.3
pallet_contracts/seal_weight_to_fee 118.2
pallet_contracts/seal_gas 124.0
pallet_contracts/seal_input 131.2
pallet_contracts/seal_input_per_kb 91.5
pallet_contracts/seal_return 161.4
pallet_contracts/seal_return_per_kb 151.0
pallet_contracts/seal_terminate 153.9
pallet_contracts/seal_random 117.0
pallet_contracts/seal_deposit_event 114.4
pallet_contracts/seal_deposit_event_per_topic_and_kb 110.4
pallet_contracts/seal_debug_message 120.4
pallet_contracts/seal_set_storage 112.9
pallet_contracts/seal_set_storage_per_new_kb 117.7
pallet_contracts/seal_set_storage_per_old_kb 115.3
pallet_contracts/seal_clear_storage 111.3
pallet_contracts/seal_clear_storage_per_kb 114.1
pallet_contracts/seal_get_storage 112.0
pallet_contracts/seal_get_storage_per_kb 113.1
pallet_contracts/seal_contains_storage 112.4
pallet_contracts/seal_contains_storage_per_kb 115.7
pallet_contracts/seal_take_storage 110.6
pallet_contracts/seal_take_storage_per_kb 112.3
pallet_contracts/seal_transfer 111.2
pallet_contracts/seal_call 169.5
pallet_contracts/seal_delegate_call 173.1
pallet_contracts/seal_call_per_transfer_clone_kb 129.0
pallet_contracts/seal_instantiate 151.5
pallet_contracts/seal_instantiate_per_transfer_salt_kb 113.4
pallet_contracts/seal_hash_sha2_256 124.5
pallet_contracts/seal_hash_sha2_256_per_kb 109.4
pallet_contracts/seal_hash_keccak_256 120.2
pallet_contracts/seal_hash_keccak_256_per_kb 105.4
pallet_contracts/seal_hash_blake2_256 121.7
pallet_contracts/seal_hash_blake2_256_per_kb 108.5
pallet_contracts/seal_hash_blake2_128 124.3
pallet_contracts/seal_hash_blake2_128_per_kb 109.3
pallet_contracts/seal_ecdsa_recover 107.9
pallet_contracts/seal_ecdsa_to_eth_address 107.4
pallet_contracts/seal_set_code_hash 111.6
pallet_contracts/instr_i64const 163.1
pallet_contracts/instr_i64load 145.3
pallet_contracts/instr_i64store 141.0
pallet_contracts/instr_select 139.8
pallet_contracts/instr_if 144.5
pallet_contracts/instr_br 155.2
pallet_contracts/instr_br_if 148.0
pallet_contracts/instr_br_table 147.9
pallet_contracts/instr_br_table_per_entry 189.8
pallet_contracts/instr_call 127.4
pallet_contracts/instr_call_indirect 135.7
pallet_contracts/instr_call_indirect_per_param 144.0
pallet_contracts/instr_local_get 165.8
pallet_contracts/instr_local_set 164.0
pallet_contracts/instr_local_tee 156.4
pallet_contracts/instr_global_get 152.8
pallet_contracts/instr_global_set 152.2
pallet_contracts/instr_memory_current 159.8
pallet_contracts/instr_memory_grow 156.0
pallet_contracts/instr_i64clz 151.3
pallet_contracts/instr_i64ctz 153.0
pallet_contracts/instr_i64popcnt 153.2
pallet_contracts/instr_i64eqz 152.6
pallet_contracts/instr_i64extendsi32 153.6
pallet_contracts/instr_i64extendui32 154.0
pallet_contracts/instr_i32wrapi64 153.2
pallet_contracts/instr_i64eq 146.0
pallet_contracts/instr_i64ne 146.5
pallet_contracts/instr_i64lts 146.1
pallet_contracts/instr_i64ltu 146.7
pallet_contracts/instr_i64gts 146.3
pallet_contracts/instr_i64gtu 145.3
pallet_contracts/instr_i64les 146.0
pallet_contracts/instr_i64leu 145.3
pallet_contracts/instr_i64ges 145.6
pallet_contracts/instr_i64geu 146.0
pallet_contracts/instr_i64add 146.0
pallet_contracts/instr_i64sub 145.2
pallet_contracts/instr_i64mul 146.1
pallet_contracts/instr_i64divs 146.2
pallet_contracts/instr_i64divu 145.2
pallet_contracts/instr_i64rems 146.0
pallet_contracts/instr_i64remu 146.8
pallet_contracts/instr_i64and 145.9
pallet_contracts/instr_i64or 144.7
pallet_contracts/instr_i64xor 142.4
pallet_contracts/instr_i64shl 143.5
pallet_contracts/instr_i64shrs 145.0
pallet_contracts/instr_i64shru 145.1
pallet_contracts/instr_i64rotl 145.2
pallet_contracts/instr_i64rotr 145.7
pallet_identity/add_registrar 102.1
pallet_identity/set_identity 108.4
pallet_identity/set_subs_new 107.2
pallet_identity/set_subs_old 107.3
pallet_identity/clear_identity 107.2
pallet_identity/request_judgement 106.9
pallet_identity/cancel_request 107.5
pallet_identity/set_fee 103.0
pallet_identity/set_account_id 103.6
pallet_identity/set_fields 103.8
pallet_identity/provide_judgement 106.4
pallet_identity/kill_identity 108.1
pallet_identity/add_sub 105.5
pallet_identity/rename_sub 103.0
pallet_identity/remove_sub 105.2
pallet_identity/quit_sub 104.3
pallet_multisig/as_multi_threshold_1 102.7
pallet_multisig/as_multi_create 105.6
pallet_multisig/as_multi_create_store 104.5
pallet_multisig/as_multi_approve 104.7
pallet_multisig/as_multi_approve_store 105.2
pallet_multisig/as_multi_complete 106.7
pallet_multisig/approve_as_multi_create 106.3
pallet_multisig/approve_as_multi_approve 104.7
pallet_multisig/approve_as_multi_complete 108.1
pallet_multisig/cancel_as_multi 106.9

Here are the top regressions:

Benchmark results
time %
pallet_contracts/instr_br_table_per_entry 189.8
pallet_contracts/seal_delegate_call 173.1
pallet_contracts/seal_call 169.5
pallet_contracts/instr_local_get 165.8
pallet_contracts/instr_local_set 164.0
pallet_contracts/instr_i64const 163.1
pallet_contracts/seal_return 161.4
pallet_contracts/instr_memory_current 159.8
pallet_contracts/instr_local_tee 156.4
pallet_contracts/instr_memory_grow 156.0
pallet_contracts/instr_br 155.2
pallet_contracts/instr_i64extendui32 154.0
pallet_contracts/seal_terminate 153.9
pallet_contracts/instr_i64extendsi32 153.6
pallet_contracts/instr_i64popcnt 153.2
pallet_contracts/instr_i32wrapi64 153.2
pallet_contracts/instr_i64ctz 153.0
pallet_contracts/instr_global_get 152.8
pallet_contracts/instr_i64eqz 152.6
pallet_contracts/instr_global_set 152.2
pallet_contracts/seal_instantiate 151.5
pallet_contracts/instr_i64clz 151.3
pallet_contracts/seal_return_per_kb 151.0
pallet_contracts/instr_br_if 148.0
pallet_contracts/instr_br_table 147.9
pallet_contracts/instr_i64remu 146.8
pallet_contracts/instr_i64ltu 146.7
pallet_contracts/instr_i64ne 146.5
pallet_contracts/instr_i64gts 146.3
pallet_contracts/instr_i64divs 146.2
pallet_contracts/instr_i64mul 146.1
pallet_contracts/instr_i64lts 146.1
pallet_contracts/instr_i64rems 146.0
pallet_contracts/instr_i64add 146.0
pallet_contracts/instr_i64geu 146.0
pallet_contracts/instr_i64les 146.0
pallet_contracts/instr_i64eq 146.0
pallet_contracts/instr_i64and 145.9
pallet_contracts/instr_i64rotr 145.7
pallet_contracts/instr_i64ges 145.6
pallet_contracts/instr_i64gtu 145.3
pallet_contracts/instr_i64load 145.3
pallet_contracts/instr_i64leu 145.3
pallet_contracts/instr_i64rotl 145.2
pallet_contracts/instr_i64divu 145.2
pallet_contracts/instr_i64sub 145.2
pallet_contracts/instr_i64shru 145.1
pallet_contracts/instr_i64shrs 145.0
pallet_contracts/instr_i64or 144.7
pallet_contracts/instr_if 144.5

So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower. This is very worrying. Assuming I haven't screwed up something while benchmarking (this was just a quick back-of-the-envelope run) then merging this in might tank the performance of chains like e.g. Moonbeam.

@bkchr
Copy link
Member Author

bkchr commented Aug 25, 2022

Okay, that sounds bad. Maybe we should try the bench bot and look the results to see if it comes to similar results.

@bkchr
Copy link
Member Author

bkchr commented Aug 25, 2022

/cmd queue -c bench-bot $ pallet dev pallet_contracts

@command-bot
Copy link

command-bot bot commented Aug 25, 2022

@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment /cmd cancel 32-64a42c2b-6b5c-4bfc-abaa-db17c0648d4b to cancel this command or /cmd cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Aug 25, 2022

@bkchr Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806/artifacts/download.

@ggwpez
Copy link
Member

ggwpez commented Aug 25, 2022

Not great but also not horrible either: SWC,
image

So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower.

Did you take the component ranges into account?
Often the asymptotic weights are quite different from the base values.

@bkchr
Copy link
Member Author

bkchr commented Aug 25, 2022

Not great but also not horrible either: SWC, image

So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower.

Did you take the component ranges into account? Often the asymptotic weights are quite different from the base values.

Nice tool!

So, what are we going to do? I assume the wasmtime guys will not accept reverting their commit that makes our code slower? 😬

@bkchr
Copy link
Member Author

bkchr commented Aug 25, 2022

There are quite a lot of these migrations to this ISLE syntax. Someone wants to dig into this? 🙈

@koute
Copy link
Contributor

koute commented Aug 26, 2022

Did you take the component ranges into account? Often the asymptotic weights are quite different from the base values.

My comparisons were based on the raw times from the .json dump you can get the benchmarks to output. (Which are in general more appropriate for apples-to-apples performance comparisons than just comparing weights.)

Nice tool!

I agree! @ggwpez How exactly are the "Old [Time]" and "New [Time]" columns calculated there? Are those based on raw times?

So, what are we going to do? I assume the wasmtime guys will not accept reverting their commit that makes our code slower?

I'm sure they won't, but they do care about performance. A regression like this is surely bound to affect other people too. So what I think we should do is to make a minimal reproduction that clearly demonstrates the problem (port the benchmark into a standalone WASM blob and make a minimal standalone host which calls into it), and post an issue on their issue tracker.

There are quite a lot of these migrations to this ISLE syntax. Someone wants to dig into this?

I could do this but... I just said the other day I don't want to get distracted anymore before clearing up my current TODO list, so... if there are any volunteers, feel free to take a stab, and if not then I'll do it... (:

@athei
Copy link
Member

athei commented Aug 26, 2022

It looks like the regression is wasmi given that the instruction benchmarks suffered the most. So maybe just running coremark with wasmi might be a good test. Wasmi itself will be run by wasmtime in this case of course.

@ggwpez
Copy link
Member

ggwpez commented Aug 26, 2022

My comparisons were based on the raw times from the .json dump you can get the benchmarks to output. (Which are in general more appropriate for apples-to-apples performance comparisons than just comparing weights.)

Yea for raw WASM performance that makes sense to not look at the post-processed weights.

I agree! @ggwpez How exactly are the "Old [Time]" and "New [Time]" columns calculated there? Are those based on raw times?

It uses the weight formulas from the Rust weight.rs files by parsing the code.
So in your case its probably better to use raw data, but the weights are what we want to keep low.
You can click on each extrinsic to see the formula.
The upper bounds for each component are also extracted from the source code to provide exact worst-case weights:
image

Which shows that these weights were probably out-dated since the number of storage accesses changed.

@bkchr
Copy link
Member Author

bkchr commented Aug 29, 2022

Which shows that these weights were probably out-dated since the number of storage accesses changed.

Sounds like we should regenerate these weights on master to have a fair comparison?

@koute
Copy link
Contributor

koute commented Aug 31, 2022

Looks like there are no volunteers after all, so I'll do it. (:

@koute
Copy link
Contributor

koute commented Aug 31, 2022

It looks like the regression is wasmi given that the instruction benchmarks suffered the most. So maybe just running coremark with wasmi might be a good test. Wasmi itself will be run by wasmtime in this case of course.

That doesn't look to be the case from what I can see. I've hacked up a benchmark across many wasmtime versions, and here are the results I got for coremark (coremark scores, so higher is better):

wasmtime 0.40,174.35779
wasmtime 0.39,174.94751
wasmtime 0.38,175.75722
wasmtime 0.37,140.76576
wasmtime 0.36,178.81624
wasmtime 0.35,174.08461
wasmtime 0.34,182.12724
wasmtime 0.33,178.41212
native,587.1677

So there might be a slight performance regression between 0.38 and 0.40, but it's nothing major.

I'll try something heavier. Maybe the wasmi benches from here.

@koute
Copy link
Contributor

koute commented Sep 1, 2022

I still don't know why this happens, but, I've just discovered something funny. Apparently doing this in frame/contracts/Cargo.toml partially fixes the regression for at least one benchmark... (I haven't reran all of them yet)

- sp-io = { version = "6.0.0", default-features = false, path = "../../primitives/io" }
+ sp-io = { version = "6.0.0", default-features = false, path = "../../primitives/io", features = ["improved_panic_error_reporting"] }

extrinsic_time for pallet_contracts/seal_return_per_kb on wasmtime 0.40 before this change: 362810
extrinsic_time for pallet_contracts/seal_return_per_kb on wasmtime 0.40 after this change: 231050
extrinsic_time for pallet_contracts/seal_return_per_kb on wasmtime 0.38: 218880

I've ran this multiple times to be sure; the results are consistent.

Just doesn't really make much sense as it just enables the feature from this PR of mine from a while back so my guess is that it affects the optimizations somehow in a weird way? It would explain why I can't reproduce the issue in a minimal example - I've ported the contracts benchmark into a minimal example but I see no regression there. So I guess I'll either have to go the other way around - start with a full substrate build and just start deleting code until I end up with a minimal reproduction, or (and this is probably a better idea) bisect wasmtime between 0.38 and 0.40 and see which commit affects this and go from there.

@athei
Copy link
Member

athei commented Sep 1, 2022

Your conclusion seems plausible. Optimization behavior can be erratic. This is further supported by the fact that you can't reproduce the regression when benchmarking wasmi directly.

@pepyakin
Copy link
Contributor

pepyakin commented Sep 1, 2022

I am not aware of how that could possibly affect optimizations.

However, 0.40 implemented a custom stack walking mechanism which supposed to make those things faster (although, I did not dig into this current problem enough to be confident)

@koute
Copy link
Contributor

koute commented Sep 2, 2022

Okay, this is one of the weirdest heisenbugs I've ever seen.

I did bisect wasmtime, and it pointed to this commit. When reverted the extrinsic times return to normal. So far so good. While trying to investigate in more detail I recompiled with a WASM_BUILD_RUSTFLAGS="-g" flag and... the behavior flipped. That is, I only got normal extrinsic times with that commit applied and when reverted (which previously "fixed" the issue) the problem came back! So while trying to debug this I've made other various changes and I've hacked the executor so that I can forcibly replace the WASM blob at runtime through an environment variable, and I tried rerunning all of the previous WASM blob variants that I've already tested, but without recompiling the host. What I found was that some of the WASM blobs which were "good" now ran like the "bad" ones, and some of the "bad" ones ran just as "good" ones.

So what I can conclude right now:

  1. I might be starting to go crazy debugging this. (:
  2. We have a really weird bug somewhere.
  3. The issue is somewhere on the boundary of the host and the WASM blob; changing either can trigger or hide the issue.
  4. Performance-wise it might not affect real production code, but even if that's the case I'd really like to know why this happens; I don't want our weights semi-randomly changing their values by ~20~80% based on some voodoo.

I still don't know exactly why this happens, but I can already see it's (as usual) probably going to be something very stupid.

@bkchr
Copy link
Member Author

bkchr commented Sep 2, 2022

@koute maybe there is something like the ahash compile time randomness? Aka something that gets sourced at compile time and that changes the value based on how you compile it? Or may change random code that doesn't really seem to be related?

@koute
Copy link
Contributor

koute commented Sep 5, 2022

@koute maybe there is something like the ahash compile time randomness? Aka something that gets sourced at compile time and that changes the value based on how you compile it? Or may change random code that doesn't really seem to be related?

Yeah, that's one of the things I'm checking. Although it seems like the behavior is consistent between compilations (if I compile the the same code with exactly the same flags I get the same behavior) so if it's due to the randomness it's not using a truly random seed.

Copy link
Contributor

@koute koute left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine to merge after making sure the rustix version we use is aligned with what wasmtime uses. (IIRC they should be the same now so we shouldn't need to have two of them anymore)

@pepyakin
Copy link
Contributor

pepyakin commented Sep 6, 2022

FWIW, we have #10707 for tracking the non-MVP features on the executor

@@ -18,22 +18,22 @@
//! Autogenerated weights for pallet_contracts
//!
//! THIS FILE WAS AUTO-GENERATED USING THE SUBSTRATE BENCHMARK CLI VERSION 4.0.0-dev
//! DATE: 2022-06-22, STEPS: `50`, REPEAT: 20, LOW RANGE: `[]`, HIGH RANGE: `[]`
//! DATE: 2022-08-25, STEPS: `50`, REPEAT: 20, LOW RANGE: `[]`, HIGH RANGE: `[]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need that file updated? Why this, and why not other benchmarks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we only had the bot run for this.

@@ -18,7 +18,7 @@ codec = { package = "parity-scale-codec", version = "3.0.0", default-features =
impl-trait-for-tuples = "0.2.2"
log = { version = "0.4.17", optional = true }
wasmi = { version = "0.9.1", optional = true }
wasmtime = { version = "0.38.0", default-features = false, optional = true }
wasmtime = { version = "0.40.0", default-features = false, optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make it 0.40.1 as in executor's cargo.toml?

Copy link
Member

@athei athei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down :)

@bkchr
Copy link
Member Author

bkchr commented Sep 7, 2022

bot merge

@paritytech-processbot
Copy link

Waiting for commit status.

@paritytech-processbot
Copy link

Merge cancelled due to error. Error: Statuses failed for 4a86689

@bkchr
Copy link
Member Author

bkchr commented Sep 7, 2022

@athei so should I regenerate the weights again here?

@koute
Copy link
Contributor

koute commented Sep 7, 2022

After further investigation I have some final notes regarding the slowdown we saw here. Let me quickly walk you through it.

Details

The memset function which was the culprit here is defined in Rust's compiler-builtins and looks like this: (It's called set_bytes, but trust me, it's memset.)

#[inline(always)]
pub unsafe fn set_bytes(mut s: *mut u8, c: u8, mut n: usize) {
    #[inline(always)]
    pub unsafe fn set_bytes_bytes(mut s: *mut u8, c: u8, n: usize) {
        let end = s.add(n);
        while s < end {
            *s = c;
            s = s.add(1);
        }
    }

    #[inline(always)]
    pub unsafe fn set_bytes_words(s: *mut u8, c: u8, n: usize) {
        let mut broadcast = c as usize;
        let mut bits = 8;
        while bits < WORD_SIZE * 8 {
            broadcast |= broadcast << bits;
            bits *= 2;
        }

        let mut s_usize = s as *mut usize;
        let end = s.add(n) as *mut usize;

        while s_usize < end {
            *s_usize = broadcast;
            s_usize = s_usize.add(1);
        }
    }

    if likely(n >= WORD_COPY_THRESHOLD) {
        // Align s
        // Because of n >= 2 * WORD_SIZE, dst_misalignment < n
        let misalignment = (s as usize).wrapping_neg() & WORD_MASK;
        set_bytes_bytes(s, c, misalignment);
        s = s.add(misalignment);
        n -= misalignment;

        let n_words = n & !WORD_MASK;
        set_bytes_words(s, c, n_words);
        s = s.add(n_words);
        n -= n_words;
    }
    set_bytes_bytes(s, c, n);
}

Now, you can just ignore most of this code as it's irrelevant to the performance and is here to handle things like misaligned pointers or very small buffers. What we're interested the most in is this little inner loop here inside of set_bytes_words in which every big memset will spend 99% of its time:

while s_usize < end {
    *s_usize = broadcast;
    s_usize = s_usize.add(1);
}

What it does is that it just fills the memory in a loop, in chunks of usize bytes each. Super simple.

Now, here's how our friend wasmtime/cranelift compiles down this little loop into assembly. This is the exact code we were running; comments are mine for those who are not accustomed to reading assembly but still want to understand what it does.

loop:
41 8b c2            mov    %r10d,%eax ; moves the pointer (well, offset relative to the linear memory to be exact) to the buffer into another register
44 89 5c 06 00      mov    %r11d,0x0(%rsi,%rax,1) ; writes 4 bytes into our buffer (rsi is the linear memory base pointer and this instructions adds it together to our pointer from rax/eax before dereferencing)
41 83 c2 04         add    $0x4,%r10d ; move the pointer forward by 4 bytes
45 39 ca            cmp    %r9d,%r10d ; is the pointer equal to the end pointer, that is - have we reached the end?
0f 82 eb ff ff ff   jb loop ; if not repeat the loop

So now let me show this very same snippet a few more times, but at different offsets in memory.

(1) Offset of 0 (fast execution)

11ceaeb:  2 (+43):  41 8b c2                 mov    %r10d,%eax
11ceaee:  2 (+46):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceaf3:  2 (+51):  41 83 c2 04              add    $0x4,%r10d
11ceaf7:  2 (+55):  45 39 ca                 cmp    %r9d,%r10d
11ceafa:  2 (+58):  0f 82 eb ff ff ff        jb     loop

(2) Offset of 1 (slow execution)

11ceaec:  2 (+44):  41 8b c2                 mov    %r10d,%eax
11ceaef:  2 (+47):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceaf4:  2 (+52):  41 83 c2 04              add    $0x4,%r10d
11ceaf8:  2 (+56):  45 39 ca                 cmp    %r9d,%r10d
11ceafb:  2 (+59):  0f 82 eb ff ff           jb     loop
11ceb00:  3 (+00):  ff                       (last instruction continued)

...all of the offsets in-between here are also slow...

(3) Offset of 13 (still slow execution)

11ceaf8:  1 (+56):  41 8b c2                 mov    %r10d,%eax
11ceafb:  1 (+59):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceb00:  2 (+00):  41 83 c2 04              add    $0x4,%r10d
11ceb04:  2 (+04):  45 39 ca                 cmp    %r9d,%r10d
11ceb07:  2 (+07):  0f 82 eb ff ff ff        jb     loop

... all of the offsets in-between here are also slow...

(4) Offset of 20 (still slow execution)

11ceaff:  1 (+63):  41                       mov    %r10d,%eax
11ceb00:  2 (+00):  8b c2                    (last instruction continued)
11ceb02:  2 (+02):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceb07:  2 (+07):  41 83 c2 04              add    $0x4,%r10d
11ceb0b:  2 (+11):  45 39 ca                 cmp    %r9d,%r10d
11ceb0e:  2 (+14):  0f 82 eb ff ff ff        jb     loop

(5) Offset of 21 (fast execution again!)

11ceb00:  2 (+00):  41 8b c2                 mov    %r10d,%eax
11ceb03:  2 (+03):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceb08:  2 (+08):  41 83 c2 04              add    $0x4,%r10d
11ceb0c:  2 (+12):  45 39 ca                 cmp    %r9d,%r10d
11ceb0f:  2 (+15):  0f 82 eb ff ff ff        jb     loop

The first number is the address of the instruction, the second number is the cache line, and the third number (the one with a '+') is the offset within that cache line.

Noticed anything?

Yes! In every slow case the execution of this loop straddles two cache lines! As long as all of the code fits within a single cache line (64 bytes) then everything is fast, but if it doesn't then it slows down dramatically.

Conclusion

Depending on the phase of the moon wasmtime/cranelift can emit the inner loop of memset in such a way that it straddles two cache lines, tanking its performance.

Personally I'd classify this as a bug in wasmtime/cranelift; a native compiler (like LLVM) will insert an appropriate number of nops before loops to align them, and I don't really see why wasmtime couldn't do the same. I'll report this to wasmtime as an issue.

@athei
Copy link
Member

athei commented Sep 7, 2022

@athei so should I regenerate the weights again here?

Yes. I assume you resolved the conflict by choosing what is currently on master.

@bkchr
Copy link
Member Author

bkchr commented Sep 7, 2022

/cmd queue -c bench-bot $ pallet dev pallet_contracts

@command-bot
Copy link

command-bot bot commented Sep 7, 2022

@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment /cmd cancel 55-80e0ae84-fd12-4c65-868d-2b1344a05209 to cancel this command or /cmd cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Sep 7, 2022

@bkchr Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801/artifacts/download.

@athei
Copy link
Member

athei commented Sep 8, 2022

bot merge

@paritytech-processbot
Copy link

Waiting for commit status.

@paritytech-processbot
Copy link

Merge cancelled due to error. Error: Statuses failed for f8770aa

@bkchr bkchr merged commit e202092 into master Sep 8, 2022
@bkchr bkchr deleted the bkchr-upgrade-wasmtime-0.40 branch September 8, 2022 08:57
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
* Upgrade wasmtime to 0.40.0

* ".git/.scripts/bench-bot.sh" pallet dev pallet_contracts

* Update in the other Cargo.toml

* ".git/.scripts/bench-bot.sh" pallet dev pallet_contracts

Co-authored-by: command-bot <>
Co-authored-by: Alexander Theißen <alex.theissen@me.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A2-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Custom substrate node does not start in a VM without --wasm-execution interpreted-i-know-what-i-do
5 participants