Atomic compare_exchange(_weak) functions produce overly complicated asm code on thumbv7(e)m-none-eabi(hf) targets #79418

qwerty19106 · 2020-11-25T18:27:40Z

This code perform atomic increment:

#![no_std]

use core::sync::atomic::*;

pub extern "C" fn atomic_usize_inc(ptr: &AtomicUsize) -> usize {
    let mut old = ptr.load(Ordering::Relaxed);

    loop {
        let new = old + 1;

        match ptr.compare_exchange_weak(old, new, Ordering::Relaxed, Ordering::Relaxed) {
            Ok(_old) => break _old,
            Err(_old) => old = _old,
        }
    }
}

Expected asm:

example::atomic_usize_inc:
1:
        ldr     r1, [r0]
        add    r3, r1, #1

        ldrex   r2, [r0]
        cmp     r2, r1
        itt       ne
        clrexne
        bne     1b
        strex   r1, r3, [r0]
        cbz    r1, 2f
        b       1b
2:
        movs    r0, r2

Produced asm godbolt:

example::atomic_usize_inc:
        ldr     r2, [r0]
.LBB0_1:
        ldrex   r1, [r0]
        cmp     r1, r2
        bne     .LBB0_4
        adds    r2, #1
        strex   r3, r2, [r0]
        cbnz    r3, .LBB0_5
        movs    r2, #1
        b       .LBB0_6
.LBB0_4:
        clrex
.LBB0_5:
        movs    r2, #0
.LBB0_6:
        cbnz    r2, .LBB0_25
        ldrex   r2, [r0]
        cmp     r2, r1
        bne     .LBB0_10
        adds    r1, #1
        strex   r3, r1, [r0]
        cbnz    r3, .LBB0_11
        movs    r1, #1
        b       .LBB0_12
.LBB0_10:
        clrex
.LBB0_11:
        movs    r1, #0
.LBB0_12:
        cbnz    r1, .LBB0_24
        ldrex   r1, [r0]
        cmp     r1, r2
        bne     .LBB0_16
        adds    r2, #1
        strex   r3, r2, [r0]
        cbnz    r3, .LBB0_17
        movs    r2, #1
        b       .LBB0_18
.LBB0_16:
        clrex
.LBB0_17:
        movs    r2, #0
.LBB0_18:
        cbnz    r2, .LBB0_25
        ldrex   r2, [r0]
        cmp     r2, r1
        bne     .LBB0_22
        adds    r1, #1
        strex   r3, r1, [r0]
        cbnz    r3, .LBB0_23
        movs    r1, #1
        cmp     r1, #0
        beq     .LBB0_1
        b       .LBB0_24
.LBB0_22:
        clrex
.LBB0_23:
        movs    r1, #0
        cmp     r1, #0
        beq     .LBB0_1
.LBB0_24:
        mov     r1, r2
.LBB0_25:
        mov     r0, r1
        bx      lr

Code size is very important on Cortex-M targets (thumbv*) because some controllers have only 20KB flash!

Besides the increment instruction (adds r2, #1) was moved into ldrex/strex section. This code lost compare_exchange advantage: evaluation before ldrex to reduce tick count when Exclusive Monitor is set.

This code is useless because we can call fetch_add. But other tasks can requires compare_exchange_weak, for example atomic increment with max condition (pseudocode):

pub extern "C" fn atomic_usize_inc_with_max(ptr: &AtomicUsize) -> (bool, usize) {
    atomic {
        let old = ptr.load(Ordering::Relaxed);
        if old == max {
            return (false, old);
        }

        ptr.store(old + 1, Ordering::Relaxed);
        return (true, old);
    }
}

Meta

rustc --version --verbose:

rustc 1.50.0-nightly (1c389ffef 2020-11-24)

The text was updated successfully, but these errors were encountered:

the8472 · 2020-11-25T19:02:54Z

Code size is very important on Cortex-M targets (thumbv*) because some controllers have only 20KB flash!

It looks a lot better with -C opt-level=s

qwerty19106 · 2020-11-26T09:30:26Z

-C opt-level=2 is often used on microcontrollers if high perfomance is required. Usually it add ~20% to perfomance and add ~30% to flash using (as compared with -C opt-level=s).

I see two problems here:

-C opt-level=2 produce too many asm instruction. Code above contains 4 the same ldrex/strex blocks, which use slightly different registers. The rest of the code is the glue in between ldrex/strex blocks.

The compare_exchange_weak was been inlined with both -C opt-level=2 and -C opt-level=s. That's why I expected about the same code in both cases.

Also if one compile this code with -C opt-level=2 --target armv7-unknown-linux-gnueabihf and -C opt-level=s --target armv7-unknown-linux-gnueabihf then produced code will be contains only one ldrex/strex block in both cases (godbolt)!
This proves that the I-heavy problem exists on thumbv7*** targets.

Lock-free logic based on CAS primitive (compare_and_swap, compare_exchange and compare_exchange_weak).
Tick count when Exclusive Monitor is set should be as less as possible.
Thus any atomic primitive should work as (pseudocode):

pub extern "C" fn atomic_usize_func(ptr: &AtomicUsize) -> (bool, usize) {
    let mut old = ptr.load();

    loop {
        let new = calc_new_from_old();
        if !check_something(old, new) {
            break (false, old);
        }

        atomic {
            let _old = ptr.load();
            if old == _old {
                ptr.store(new);
                break (true, old);
            } else {
                old = _old;
            }
        }
    }
}

Note that atomic { ... } section should contains the same code at any calc_new_from_old and check_something functions.

Above we see that the increment instruction (adds r2, #1) was moved into ldrex/strex section.
The compiler_fence function not fluence to it because adds r2, #1 is not memory operation.
We could not prevent moving instruction into ldrex/strex block by compiler, but must guarantee it!

the8472 · 2020-11-26T17:37:10Z

-C opt-level=2 produce too many asm instruction. Code above contains 4 the same ldrex/strex blocks, which use slightly different registers.

That's probably LLVM's loop unrolling heuristics deciding that it's worth it for that loop under O2 but deciding differently under Os.

You could split your project into different crates, with the hot code that needs optimizations compiled with opt-level=2 and the cold/size-sensitive code in a different crate compiled with opt-level=s. RFC2282 enables per-crate build options. Alternatively you can extract the loop body into an #[inline(never)] function which should prevent loop-unrolling.

Also if one compile this code with -C opt-level=2 --target armv7-unknown-linux-gnueabihf and -C opt-level=s --target armv7-unknown-linux-gnueabihf then produced code will be contains only one ldrex/strex block in both cases (godbolt)!
This proves that the I-heavy problem exists on thumbv7*** targets.

Heuristics differ between target architectures, so it doesn't really prove something, it just shows that several different things are playing into each other here.

We could not prevent moving instruction into ldrex/strex block by compiler, but must guarantee it!

This could also be accomplished by splitting the atomic section into a #[inline(never)] function, then calc_new_from_old or check_something won't have their instructions moved into the critical section.

nagisa · 2020-11-27T19:02:36Z

You can also use the attribute that sets the optimization level on a per-function basis. It is unstable but available.

qwerty19106 added the C-bug Category: This is a bug. label Nov 25, 2020

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic compare_exchange(_weak) functions produce overly complicated asm code on thumbv7(e)m-none-eabi(hf) targets #79418

Atomic compare_exchange(_weak) functions produce overly complicated asm code on thumbv7(e)m-none-eabi(hf) targets #79418

qwerty19106 commented Nov 25, 2020

the8472 commented Nov 25, 2020

qwerty19106 commented Nov 26, 2020

the8472 commented Nov 26, 2020

nagisa commented Nov 27, 2020

Atomic compare_exchange(_weak) functions produce overly complicated asm code on thumbv7(e)m-none-eabi(hf) targets #79418

Atomic compare_exchange(_weak) functions produce overly complicated asm code on thumbv7(e)m-none-eabi(hf) targets #79418

Comments

qwerty19106 commented Nov 25, 2020

Meta

the8472 commented Nov 25, 2020

qwerty19106 commented Nov 26, 2020

the8472 commented Nov 26, 2020

nagisa commented Nov 27, 2020