Upgrade wasmtime to 0.40.0 #12096

bkchr · 2022-08-23T21:25:54Z

Closes: #12073

koute · 2022-08-25T09:13:18Z

Looking at wasmtime release notes this line got my attention:

Many more instructions are now implemented with ISLE instead of handwritten lowerings.

So I ran our basic benchmarks; those are mostly unchanged, all good. But then I ran a small subset of our pallet benches, and it seems like there are some huge performance regressions here. (These numbers are the newest master vs this PR.)

The full table of all of the benches I ran:

Benchmark results

	time %
pallet_contracts/on_process_deletion_queue_batch	103.3
pallet_contracts/on_initialize_per_trie_key	102.2
pallet_contracts/on_initialize_per_queue_item	100.7
pallet_contracts/reinstrument	99.9
pallet_contracts/call_with_code_per_byte	113.2
pallet_contracts/instantiate_with_code	105.2
pallet_contracts/instantiate	113.1
pallet_contracts/call	140.1
pallet_contracts/upload_code	102.3
pallet_contracts/remove_code	105.1
pallet_contracts/set_code	103.0
pallet_contracts/seal_caller	127.9
pallet_contracts/seal_is_contract	112.9
pallet_contracts/seal_code_hash	114.5
pallet_contracts/seal_own_code_hash	124.6
pallet_contracts/seal_caller_is_origin	139.8
pallet_contracts/seal_address	128.9
pallet_contracts/seal_gas_left	129.4
pallet_contracts/seal_balance	120.3
pallet_contracts/seal_value_transferred	130.6
pallet_contracts/seal_minimum_balance	130.2
pallet_contracts/seal_block_number	129.8
pallet_contracts/seal_now	129.3
pallet_contracts/seal_weight_to_fee	118.2
pallet_contracts/seal_gas	124.0
pallet_contracts/seal_input	131.2
pallet_contracts/seal_input_per_kb	91.5
pallet_contracts/seal_return	161.4
pallet_contracts/seal_return_per_kb	151.0
pallet_contracts/seal_terminate	153.9
pallet_contracts/seal_random	117.0
pallet_contracts/seal_deposit_event	114.4
pallet_contracts/seal_deposit_event_per_topic_and_kb	110.4
pallet_contracts/seal_debug_message	120.4
pallet_contracts/seal_set_storage	112.9
pallet_contracts/seal_set_storage_per_new_kb	117.7
pallet_contracts/seal_set_storage_per_old_kb	115.3
pallet_contracts/seal_clear_storage	111.3
pallet_contracts/seal_clear_storage_per_kb	114.1
pallet_contracts/seal_get_storage	112.0
pallet_contracts/seal_get_storage_per_kb	113.1
pallet_contracts/seal_contains_storage	112.4
pallet_contracts/seal_contains_storage_per_kb	115.7
pallet_contracts/seal_take_storage	110.6
pallet_contracts/seal_take_storage_per_kb	112.3
pallet_contracts/seal_transfer	111.2
pallet_contracts/seal_call	169.5
pallet_contracts/seal_delegate_call	173.1
pallet_contracts/seal_call_per_transfer_clone_kb	129.0
pallet_contracts/seal_instantiate	151.5
pallet_contracts/seal_instantiate_per_transfer_salt_kb	113.4
pallet_contracts/seal_hash_sha2_256	124.5
pallet_contracts/seal_hash_sha2_256_per_kb	109.4
pallet_contracts/seal_hash_keccak_256	120.2
pallet_contracts/seal_hash_keccak_256_per_kb	105.4
pallet_contracts/seal_hash_blake2_256	121.7
pallet_contracts/seal_hash_blake2_256_per_kb	108.5
pallet_contracts/seal_hash_blake2_128	124.3
pallet_contracts/seal_hash_blake2_128_per_kb	109.3
pallet_contracts/seal_ecdsa_recover	107.9
pallet_contracts/seal_ecdsa_to_eth_address	107.4
pallet_contracts/seal_set_code_hash	111.6
pallet_contracts/instr_i64const	163.1
pallet_contracts/instr_i64load	145.3
pallet_contracts/instr_i64store	141.0
pallet_contracts/instr_select	139.8
pallet_contracts/instr_if	144.5
pallet_contracts/instr_br	155.2
pallet_contracts/instr_br_if	148.0
pallet_contracts/instr_br_table	147.9
pallet_contracts/instr_br_table_per_entry	189.8
pallet_contracts/instr_call	127.4
pallet_contracts/instr_call_indirect	135.7
pallet_contracts/instr_call_indirect_per_param	144.0
pallet_contracts/instr_local_get	165.8
pallet_contracts/instr_local_set	164.0
pallet_contracts/instr_local_tee	156.4
pallet_contracts/instr_global_get	152.8
pallet_contracts/instr_global_set	152.2
pallet_contracts/instr_memory_current	159.8
pallet_contracts/instr_memory_grow	156.0
pallet_contracts/instr_i64clz	151.3
pallet_contracts/instr_i64ctz	153.0
pallet_contracts/instr_i64popcnt	153.2
pallet_contracts/instr_i64eqz	152.6
pallet_contracts/instr_i64extendsi32	153.6
pallet_contracts/instr_i64extendui32	154.0
pallet_contracts/instr_i32wrapi64	153.2
pallet_contracts/instr_i64eq	146.0
pallet_contracts/instr_i64ne	146.5
pallet_contracts/instr_i64lts	146.1
pallet_contracts/instr_i64ltu	146.7
pallet_contracts/instr_i64gts	146.3
pallet_contracts/instr_i64gtu	145.3
pallet_contracts/instr_i64les	146.0
pallet_contracts/instr_i64leu	145.3
pallet_contracts/instr_i64ges	145.6
pallet_contracts/instr_i64geu	146.0
pallet_contracts/instr_i64add	146.0
pallet_contracts/instr_i64sub	145.2
pallet_contracts/instr_i64mul	146.1
pallet_contracts/instr_i64divs	146.2
pallet_contracts/instr_i64divu	145.2
pallet_contracts/instr_i64rems	146.0
pallet_contracts/instr_i64remu	146.8
pallet_contracts/instr_i64and	145.9
pallet_contracts/instr_i64or	144.7
pallet_contracts/instr_i64xor	142.4
pallet_contracts/instr_i64shl	143.5
pallet_contracts/instr_i64shrs	145.0
pallet_contracts/instr_i64shru	145.1
pallet_contracts/instr_i64rotl	145.2
pallet_contracts/instr_i64rotr	145.7
pallet_identity/add_registrar	102.1
pallet_identity/set_identity	108.4
pallet_identity/set_subs_new	107.2
pallet_identity/set_subs_old	107.3
pallet_identity/clear_identity	107.2
pallet_identity/request_judgement	106.9
pallet_identity/cancel_request	107.5
pallet_identity/set_fee	103.0
pallet_identity/set_account_id	103.6
pallet_identity/set_fields	103.8
pallet_identity/provide_judgement	106.4
pallet_identity/kill_identity	108.1
pallet_identity/add_sub	105.5
pallet_identity/rename_sub	103.0
pallet_identity/remove_sub	105.2
pallet_identity/quit_sub	104.3
pallet_multisig/as_multi_threshold_1	102.7
pallet_multisig/as_multi_create	105.6
pallet_multisig/as_multi_create_store	104.5
pallet_multisig/as_multi_approve	104.7
pallet_multisig/as_multi_approve_store	105.2
pallet_multisig/as_multi_complete	106.7
pallet_multisig/approve_as_multi_create	106.3
pallet_multisig/approve_as_multi_approve	104.7
pallet_multisig/approve_as_multi_complete	108.1
pallet_multisig/cancel_as_multi	106.9

Here are the top regressions:

Benchmark results

	time %
pallet_contracts/instr_br_table_per_entry	189.8
pallet_contracts/seal_delegate_call	173.1
pallet_contracts/seal_call	169.5
pallet_contracts/instr_local_get	165.8
pallet_contracts/instr_local_set	164.0
pallet_contracts/instr_i64const	163.1
pallet_contracts/seal_return	161.4
pallet_contracts/instr_memory_current	159.8
pallet_contracts/instr_local_tee	156.4
pallet_contracts/instr_memory_grow	156.0
pallet_contracts/instr_br	155.2
pallet_contracts/instr_i64extendui32	154.0
pallet_contracts/seal_terminate	153.9
pallet_contracts/instr_i64extendsi32	153.6
pallet_contracts/instr_i64popcnt	153.2
pallet_contracts/instr_i32wrapi64	153.2
pallet_contracts/instr_i64ctz	153.0
pallet_contracts/instr_global_get	152.8
pallet_contracts/instr_i64eqz	152.6
pallet_contracts/instr_global_set	152.2
pallet_contracts/seal_instantiate	151.5
pallet_contracts/instr_i64clz	151.3
pallet_contracts/seal_return_per_kb	151.0
pallet_contracts/instr_br_if	148.0
pallet_contracts/instr_br_table	147.9
pallet_contracts/instr_i64remu	146.8
pallet_contracts/instr_i64ltu	146.7
pallet_contracts/instr_i64ne	146.5
pallet_contracts/instr_i64gts	146.3
pallet_contracts/instr_i64divs	146.2
pallet_contracts/instr_i64mul	146.1
pallet_contracts/instr_i64lts	146.1
pallet_contracts/instr_i64rems	146.0
pallet_contracts/instr_i64add	146.0
pallet_contracts/instr_i64geu	146.0
pallet_contracts/instr_i64les	146.0
pallet_contracts/instr_i64eq	146.0
pallet_contracts/instr_i64and	145.9
pallet_contracts/instr_i64rotr	145.7
pallet_contracts/instr_i64ges	145.6
pallet_contracts/instr_i64gtu	145.3
pallet_contracts/instr_i64load	145.3
pallet_contracts/instr_i64leu	145.3
pallet_contracts/instr_i64rotl	145.2
pallet_contracts/instr_i64divu	145.2
pallet_contracts/instr_i64sub	145.2
pallet_contracts/instr_i64shru	145.1
pallet_contracts/instr_i64shrs	145.0
pallet_contracts/instr_i64or	144.7
pallet_contracts/instr_if	144.5

So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower. This is very worrying. Assuming I haven't screwed up something while benchmarking (this was just a quick back-of-the-envelope run) then merging this in might tank the performance of chains like e.g. Moonbeam.

bkchr · 2022-08-25T12:33:28Z

Okay, that sounds bad. Maybe we should try the bench bot and look the results to see if it comes to similar results.

bkchr · 2022-08-25T12:36:21Z

/cmd queue -c bench-bot $ pallet dev pallet_contracts

command-bot · 2022-08-25T12:36:26Z

@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment /cmd cancel 32-64a42c2b-6b5c-4bfc-abaa-db17c0648d4b to cancel this command or /cmd cancel to cancel all commands in this pull request.

… HEAD

command-bot · 2022-08-25T15:34:18Z

@bkchr Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806/artifacts/download.

ggwpez · 2022-08-25T15:40:00Z

Not great but also not horrible either: SWC,

So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower.

Did you take the component ranges into account?
Often the asymptotic weights are quite different from the base values.

bkchr · 2022-08-25T21:37:55Z

Not great but also not horrible either: SWC,

So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower.

Did you take the component ranges into account? Often the asymptotic weights are quite different from the base values.

Nice tool!

So, what are we going to do? I assume the wasmtime guys will not accept reverting their commit that makes our code slower? 😬

bkchr · 2022-08-25T21:46:46Z

There are quite a lot of these migrations to this ISLE syntax. Someone wants to dig into this? 🙈

koute · 2022-08-26T05:17:53Z

Did you take the component ranges into account? Often the asymptotic weights are quite different from the base values.

My comparisons were based on the raw times from the .json dump you can get the benchmarks to output. (Which are in general more appropriate for apples-to-apples performance comparisons than just comparing weights.)

Nice tool!

I agree! @ggwpez How exactly are the "Old [Time]" and "New [Time]" columns calculated there? Are those based on raw times?

So, what are we going to do? I assume the wasmtime guys will not accept reverting their commit that makes our code slower?

I'm sure they won't, but they do care about performance. A regression like this is surely bound to affect other people too. So what I think we should do is to make a minimal reproduction that clearly demonstrates the problem (port the benchmark into a standalone WASM blob and make a minimal standalone host which calls into it), and post an issue on their issue tracker.

There are quite a lot of these migrations to this ISLE syntax. Someone wants to dig into this?

I could do this but... I just said the other day I don't want to get distracted anymore before clearing up my current TODO list, so... if there are any volunteers, feel free to take a stab, and if not then I'll do it... (:

athei · 2022-08-26T08:05:52Z

It looks like the regression is wasmi given that the instruction benchmarks suffered the most. So maybe just running coremark with wasmi might be a good test. Wasmi itself will be run by wasmtime in this case of course.

ggwpez · 2022-08-26T11:03:06Z

My comparisons were based on the raw times from the .json dump you can get the benchmarks to output. (Which are in general more appropriate for apples-to-apples performance comparisons than just comparing weights.)

Yea for raw WASM performance that makes sense to not look at the post-processed weights.

I agree! @ggwpez How exactly are the "Old [Time]" and "New [Time]" columns calculated there? Are those based on raw times?

It uses the weight formulas from the Rust weight.rs files by parsing the code.
So in your case its probably better to use raw data, but the weights are what we want to keep low.
You can click on each extrinsic to see the formula.
The upper bounds for each component are also extracted from the source code to provide exact worst-case weights:

Which shows that these weights were probably out-dated since the number of storage accesses changed.

bkchr · 2022-08-29T08:06:41Z

Which shows that these weights were probably out-dated since the number of storage accesses changed.

Sounds like we should regenerate these weights on master to have a fair comparison?

koute · 2022-08-31T05:58:27Z

Looks like there are no volunteers after all, so I'll do it. (:

koute · 2022-08-31T08:24:23Z

It looks like the regression is wasmi given that the instruction benchmarks suffered the most. So maybe just running coremark with wasmi might be a good test. Wasmi itself will be run by wasmtime in this case of course.

That doesn't look to be the case from what I can see. I've hacked up a benchmark across many wasmtime versions, and here are the results I got for coremark (coremark scores, so higher is better):

wasmtime 0.40,174.35779
wasmtime 0.39,174.94751
wasmtime 0.38,175.75722
wasmtime 0.37,140.76576
wasmtime 0.36,178.81624
wasmtime 0.35,174.08461
wasmtime 0.34,182.12724
wasmtime 0.33,178.41212
native,587.1677

So there might be a slight performance regression between 0.38 and 0.40, but it's nothing major.

I'll try something heavier. Maybe the wasmi benches from here.

koute · 2022-09-01T14:26:58Z

I still don't know why this happens, but, I've just discovered something funny. Apparently doing this in frame/contracts/Cargo.toml partially fixes the regression for at least one benchmark... (I haven't reran all of them yet)

- sp-io = { version = "6.0.0", default-features = false, path = "../../primitives/io" }
+ sp-io = { version = "6.0.0", default-features = false, path = "../../primitives/io", features = ["improved_panic_error_reporting"] }

extrinsic_time for pallet_contracts/seal_return_per_kb on wasmtime 0.40 before this change: 362810
extrinsic_time for pallet_contracts/seal_return_per_kb on wasmtime 0.40 after this change: 231050
extrinsic_time for pallet_contracts/seal_return_per_kb on wasmtime 0.38: 218880

I've ran this multiple times to be sure; the results are consistent.

Just doesn't really make much sense as it just enables the feature from this PR of mine from a while back so my guess is that it affects the optimizations somehow in a weird way? It would explain why I can't reproduce the issue in a minimal example - I've ported the contracts benchmark into a minimal example but I see no regression there. So I guess I'll either have to go the other way around - start with a full substrate build and just start deleting code until I end up with a minimal reproduction, or (and this is probably a better idea) bisect wasmtime between 0.38 and 0.40 and see which commit affects this and go from there.

athei · 2022-09-01T14:34:24Z

Your conclusion seems plausible. Optimization behavior can be erratic. This is further supported by the fact that you can't reproduce the regression when benchmarking wasmi directly.

pepyakin · 2022-09-01T15:11:48Z

I am not aware of how that could possibly affect optimizations.

However, 0.40 implemented a custom stack walking mechanism which supposed to make those things faster (although, I did not dig into this current problem enough to be confident)

koute · 2022-09-02T14:39:05Z

Okay, this is one of the weirdest heisenbugs I've ever seen.

I did bisect wasmtime, and it pointed to this commit. When reverted the extrinsic times return to normal. So far so good. While trying to investigate in more detail I recompiled with a WASM_BUILD_RUSTFLAGS="-g" flag and... the behavior flipped. That is, I only got normal extrinsic times with that commit applied and when reverted (which previously "fixed" the issue) the problem came back! So while trying to debug this I've made other various changes and I've hacked the executor so that I can forcibly replace the WASM blob at runtime through an environment variable, and I tried rerunning all of the previous WASM blob variants that I've already tested, but without recompiling the host. What I found was that some of the WASM blobs which were "good" now ran like the "bad" ones, and some of the "bad" ones ran just as "good" ones.

So what I can conclude right now:

I might be starting to go crazy debugging this. (:
We have a really weird bug somewhere.
The issue is somewhere on the boundary of the host and the WASM blob; changing either can trigger or hide the issue.
Performance-wise it might not affect real production code, but even if that's the case I'd really like to know why this happens; I don't want our weights semi-randomly changing their values by ~20~80% based on some voodoo.

I still don't know exactly why this happens, but I can already see it's (as usual) probably going to be something very stupid.

bkchr · 2022-09-02T19:07:33Z

@koute maybe there is something like the ahash compile time randomness? Aka something that gets sourced at compile time and that changes the value based on how you compile it? Or may change random code that doesn't really seem to be related?

koute · 2022-09-05T05:51:18Z

@koute maybe there is something like the ahash compile time randomness? Aka something that gets sourced at compile time and that changes the value based on how you compile it? Or may change random code that doesn't really seem to be related?

Yeah, that's one of the things I'm checking. Although it seems like the behavior is consistent between compilations (if I compile the the same code with exactly the same flags I get the same behavior) so if it's due to the randomness it's not using a truly random seed.

koute

Should be fine to merge after making sure the rustix version we use is aligned with what wasmtime uses. (IIRC they should be the same now so we shouldn't need to have two of them anymore)

…ime-0.40

pepyakin · 2022-09-06T22:08:47Z

FWIW, we have #10707 for tracking the non-MVP features on the executor

pepyakin · 2022-09-07T08:55:51Z

frame/contracts/src/weights.rs

@@ -18,22 +18,22 @@
 //! Autogenerated weights for pallet_contracts
 //!
 //! THIS FILE WAS AUTO-GENERATED USING THE SUBSTRATE BENCHMARK CLI VERSION 4.0.0-dev
-//! DATE: 2022-06-22, STEPS: `50`, REPEAT: 20, LOW RANGE: `[]`, HIGH RANGE: `[]`
+//! DATE: 2022-08-25, STEPS: `50`, REPEAT: 20, LOW RANGE: `[]`, HIGH RANGE: `[]`


Do we need that file updated? Why this, and why not other benchmarks?

Because we only had the bot run for this.

pepyakin · 2022-09-07T08:56:51Z

primitives/wasm-interface/Cargo.toml

@@ -18,7 +18,7 @@ codec = { package = "parity-scale-codec", version = "3.0.0", default-features =
 impl-trait-for-tuples = "0.2.2"
 log = { version = "0.4.17", optional = true }
 wasmi = { version = "0.9.1", optional = true }
-wasmtime = { version = "0.38.0", default-features = false, optional = true }
+wasmtime = { version = "0.40.0", default-features = false, optional = true }


Nit: make it 0.40.1 as in executor's cargo.toml?

athei

Thanks for tracking this down :)

bkchr · 2022-09-07T13:23:40Z

bot merge

paritytech-processbot · 2022-09-07T13:23:47Z

Waiting for commit status.

paritytech-processbot · 2022-09-07T13:34:58Z

Merge cancelled due to error. Error: Statuses failed for 4a86689

…ime-0.40

bkchr · 2022-09-07T14:06:40Z

@athei so should I regenerate the weights again here?

koute · 2022-09-07T15:39:57Z

After further investigation I have some final notes regarding the slowdown we saw here. Let me quickly walk you through it.

Details

The memset function which was the culprit here is defined in Rust's compiler-builtins and looks like this: (It's called set_bytes, but trust me, it's memset.)

#[inline(always)]
pub unsafe fn set_bytes(mut s: *mut u8, c: u8, mut n: usize) {
    #[inline(always)]
    pub unsafe fn set_bytes_bytes(mut s: *mut u8, c: u8, n: usize) {
        let end = s.add(n);
        while s < end {
            *s = c;
            s = s.add(1);
        }
    }

    #[inline(always)]
    pub unsafe fn set_bytes_words(s: *mut u8, c: u8, n: usize) {
        let mut broadcast = c as usize;
        let mut bits = 8;
        while bits < WORD_SIZE * 8 {
            broadcast |= broadcast << bits;
            bits *= 2;
        }

        let mut s_usize = s as *mut usize;
        let end = s.add(n) as *mut usize;

        while s_usize < end {
            *s_usize = broadcast;
            s_usize = s_usize.add(1);
        }
    }

    if likely(n >= WORD_COPY_THRESHOLD) {
        // Align s
        // Because of n >= 2 * WORD_SIZE, dst_misalignment < n
        let misalignment = (s as usize).wrapping_neg() & WORD_MASK;
        set_bytes_bytes(s, c, misalignment);
        s = s.add(misalignment);
        n -= misalignment;

        let n_words = n & !WORD_MASK;
        set_bytes_words(s, c, n_words);
        s = s.add(n_words);
        n -= n_words;
    }
    set_bytes_bytes(s, c, n);
}

Now, you can just ignore most of this code as it's irrelevant to the performance and is here to handle things like misaligned pointers or very small buffers. What we're interested the most in is this little inner loop here inside of set_bytes_words in which every big memset will spend 99% of its time:

while s_usize < end {
    *s_usize = broadcast;
    s_usize = s_usize.add(1);
}

What it does is that it just fills the memory in a loop, in chunks of usize bytes each. Super simple.

Now, here's how our friend wasmtime/cranelift compiles down this little loop into assembly. This is the exact code we were running; comments are mine for those who are not accustomed to reading assembly but still want to understand what it does.

loop:
41 8b c2            mov    %r10d,%eax ; moves the pointer (well, offset relative to the linear memory to be exact) to the buffer into another register
44 89 5c 06 00      mov    %r11d,0x0(%rsi,%rax,1) ; writes 4 bytes into our buffer (rsi is the linear memory base pointer and this instructions adds it together to our pointer from rax/eax before dereferencing)
41 83 c2 04         add    $0x4,%r10d ; move the pointer forward by 4 bytes
45 39 ca            cmp    %r9d,%r10d ; is the pointer equal to the end pointer, that is - have we reached the end?
0f 82 eb ff ff ff   jb loop ; if not repeat the loop

So now let me show this very same snippet a few more times, but at different offsets in memory.

(1) Offset of 0 (fast execution)

11ceaeb:  2 (+43):  41 8b c2                 mov    %r10d,%eax
11ceaee:  2 (+46):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceaf3:  2 (+51):  41 83 c2 04              add    $0x4,%r10d
11ceaf7:  2 (+55):  45 39 ca                 cmp    %r9d,%r10d
11ceafa:  2 (+58):  0f 82 eb ff ff ff        jb     loop

(2) Offset of 1 (slow execution)

11ceaec:  2 (+44):  41 8b c2                 mov    %r10d,%eax
11ceaef:  2 (+47):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceaf4:  2 (+52):  41 83 c2 04              add    $0x4,%r10d
11ceaf8:  2 (+56):  45 39 ca                 cmp    %r9d,%r10d
11ceafb:  2 (+59):  0f 82 eb ff ff           jb     loop
11ceb00:  3 (+00):  ff                       (last instruction continued)

...all of the offsets in-between here are also slow...

(3) Offset of 13 (still slow execution)

11ceaf8:  1 (+56):  41 8b c2                 mov    %r10d,%eax
11ceafb:  1 (+59):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceb00:  2 (+00):  41 83 c2 04              add    $0x4,%r10d
11ceb04:  2 (+04):  45 39 ca                 cmp    %r9d,%r10d
11ceb07:  2 (+07):  0f 82 eb ff ff ff        jb     loop

... all of the offsets in-between here are also slow...

(4) Offset of 20 (still slow execution)

11ceaff:  1 (+63):  41                       mov    %r10d,%eax
11ceb00:  2 (+00):  8b c2                    (last instruction continued)
11ceb02:  2 (+02):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceb07:  2 (+07):  41 83 c2 04              add    $0x4,%r10d
11ceb0b:  2 (+11):  45 39 ca                 cmp    %r9d,%r10d
11ceb0e:  2 (+14):  0f 82 eb ff ff ff        jb     loop

(5) Offset of 21 (fast execution again!)

11ceb00:  2 (+00):  41 8b c2                 mov    %r10d,%eax
11ceb03:  2 (+03):  44 89 5c 06 00           mov    %r11d,0x0(%rsi,%rax,1)
11ceb08:  2 (+08):  41 83 c2 04              add    $0x4,%r10d
11ceb0c:  2 (+12):  45 39 ca                 cmp    %r9d,%r10d
11ceb0f:  2 (+15):  0f 82 eb ff ff ff        jb     loop

The first number is the address of the instruction, the second number is the cache line, and the third number (the one with a '+') is the offset within that cache line.

Noticed anything?

Yes! In every slow case the execution of this loop straddles two cache lines! As long as all of the code fits within a single cache line (64 bytes) then everything is fast, but if it doesn't then it slows down dramatically.

Conclusion

Depending on the phase of the moon wasmtime/cranelift can emit the inner loop of memset in such a way that it straddles two cache lines, tanking its performance.

Personally I'd classify this as a bug in wasmtime/cranelift; a native compiler (like LLVM) will insert an appropriate number of nops before loops to align them, and I don't really see why wasmtime couldn't do the same. I'll report this to wasmtime as an issue.

athei · 2022-09-07T15:43:26Z

@athei so should I regenerate the weights again here?

Yes. I assume you resolved the conflict by choosing what is currently on master.

bkchr · 2022-09-07T16:16:47Z

/cmd queue -c bench-bot $ pallet dev pallet_contracts

command-bot · 2022-09-07T16:16:50Z

@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment /cmd cancel 55-80e0ae84-fd12-4c65-868d-2b1344a05209 to cancel this command or /cmd cancel to cancel all commands in this pull request.

… HEAD

command-bot · 2022-09-07T19:56:07Z

@bkchr Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801/artifacts/download.

…ime-0.40

athei · 2022-09-08T08:29:09Z

Weights for contracts look fine: https://weights.tasty.limo/compare?repo=substrate&threshold=5&path_pattern=frame%2F*%2Fsrc%2Fweights.rs&method=guess-worst&ignore_errors=true&unit=weight&old=master&new=bkchr-upgrade-wasmtime-0.40

athei · 2022-09-08T08:29:25Z

bot merge

paritytech-processbot · 2022-09-08T08:29:33Z

Waiting for commit status.

paritytech-processbot · 2022-09-08T08:43:43Z

Merge cancelled due to error. Error: Statuses failed for f8770aa

* Upgrade wasmtime to 0.40.0 * ".git/.scripts/bench-bot.sh" pallet dev pallet_contracts * Update in the other Cargo.toml * ".git/.scripts/bench-bot.sh" pallet dev pallet_contracts Co-authored-by: command-bot <> Co-authored-by: Alexander Theißen <alex.theissen@me.com>

Upgrade wasmtime to 0.40.0

cd7e305

bkchr requested review from koute and pepyakin August 23, 2022 21:25

command-bot added 2 commits August 25, 2022 12:36

Merge branch 'master' of https://github.com/paritytech/substrate into…

b4ef335

… HEAD

".git/.scripts/bench-bot.sh" pallet dev pallet_contracts

829d815

paritytech-ci requested a review from athei as a code owner August 25, 2022 15:34

koute approved these changes Sep 6, 2022

View reviewed changes

Merge remote-tracking branch 'origin/master' into bkchr-upgrade-wasmt…

5a7e7cf

…ime-0.40

pepyakin approved these changes Sep 7, 2022

View reviewed changes

athei approved these changes Sep 7, 2022

View reviewed changes

Update in the other Cargo.toml

4a86689

Merge remote-tracking branch 'origin/master' into bkchr-upgrade-wasmt…

bd94a51

…ime-0.40

command-bot added 2 commits September 7, 2022 16:54

Merge branch 'master' of https://github.com/paritytech/substrate into…

1ef9ad0

… HEAD

".git/.scripts/bench-bot.sh" pallet dev pallet_contracts

b33b592

koute mentioned this pull request Sep 8, 2022

Code generated by wasmtime doesn't cache-align loops bytecodealliance/wasmtime#4883

Open

Merge remote-tracking branch 'origin/master' into bkchr-upgrade-wasmt…

f8770aa

…ime-0.40

bkchr merged commit e202092 into master Sep 8, 2022

bkchr deleted the bkchr-upgrade-wasmtime-0.40 branch September 8, 2022 08:57

koute mentioned this pull request Sep 8, 2022

Enable the bulk memory operations WASM feature paritytech/polkadot-sdk#36

Open

koute mentioned this pull request Sep 29, 2022

Member Request polkadot-fellows/seeding#27

Merged

jasl mentioned this pull request Oct 8, 2022

Upgrade to Polkadot 0.9.30 Phala-Network/phala-blockchain#984

Merged

Upgrade wasmtime to 0.40.0 #12096

Upgrade wasmtime to 0.40.0 #12096

Conversation

bkchr commented Aug 23, 2022

koute commented Aug 25, 2022

bkchr commented Aug 25, 2022

bkchr commented Aug 25, 2022

command-bot bot commented Aug 25, 2022 • edited Loading

command-bot bot commented Aug 25, 2022

ggwpez commented Aug 25, 2022 • edited Loading

bkchr commented Aug 25, 2022

bkchr commented Aug 25, 2022

koute commented Aug 26, 2022

athei commented Aug 26, 2022 • edited Loading

ggwpez commented Aug 26, 2022

bkchr commented Aug 29, 2022

koute commented Aug 31, 2022

koute commented Aug 31, 2022

koute commented Sep 1, 2022

athei commented Sep 1, 2022

pepyakin commented Sep 1, 2022

koute commented Sep 2, 2022

bkchr commented Sep 2, 2022

koute commented Sep 5, 2022

koute left a comment

Choose a reason for hiding this comment

pepyakin commented Sep 6, 2022

pepyakin Sep 7, 2022

Choose a reason for hiding this comment

bkchr Sep 7, 2022

Choose a reason for hiding this comment

pepyakin Sep 7, 2022

Choose a reason for hiding this comment

athei left a comment

Choose a reason for hiding this comment

bkchr commented Sep 7, 2022

paritytech-processbot bot commented Sep 7, 2022

paritytech-processbot bot commented Sep 7, 2022

bkchr commented Sep 7, 2022

koute commented Sep 7, 2022

Details

Conclusion

athei commented Sep 7, 2022

bkchr commented Sep 7, 2022

command-bot bot commented Sep 7, 2022 • edited Loading

command-bot bot commented Sep 7, 2022

athei commented Sep 8, 2022

athei commented Sep 8, 2022

paritytech-processbot bot commented Sep 8, 2022

paritytech-processbot bot commented Sep 8, 2022

command-bot bot commented Aug 25, 2022 •

edited

Loading

ggwpez commented Aug 25, 2022 •

edited

Loading

athei commented Aug 26, 2022 •

edited

Loading

command-bot bot commented Sep 7, 2022 •

edited

Loading