Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic compare_exchange(_weak) functions produce overly complicated asm code on thumbv7(e)m-none-eabi(hf) targets #79418

Open
qwerty19106 opened this issue Nov 25, 2020 · 4 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-embedded Working group: Embedded systems

Comments

@qwerty19106
Copy link

This code perform atomic increment:

#![no_std]

use core::sync::atomic::*;

pub extern "C" fn atomic_usize_inc(ptr: &AtomicUsize) -> usize {
    let mut old = ptr.load(Ordering::Relaxed);

    loop {
        let new = old + 1;

        match ptr.compare_exchange_weak(old, new, Ordering::Relaxed, Ordering::Relaxed) {
            Ok(_old) => break _old,
            Err(_old) => old = _old,
        }
    }
}

Expected asm:

example::atomic_usize_inc:
1:
        ldr     r1, [r0]
        add    r3, r1, #1

        ldrex   r2, [r0]
        cmp     r2, r1
        itt       ne
        clrexne
        bne     1b
        strex   r1, r3, [r0]
        cbz    r1, 2f
        b       1b
2:
        movs    r0, r2

Produced asm godbolt:

example::atomic_usize_inc:
        ldr     r2, [r0]
.LBB0_1:
        ldrex   r1, [r0]
        cmp     r1, r2
        bne     .LBB0_4
        adds    r2, #1
        strex   r3, r2, [r0]
        cbnz    r3, .LBB0_5
        movs    r2, #1
        b       .LBB0_6
.LBB0_4:
        clrex
.LBB0_5:
        movs    r2, #0
.LBB0_6:
        cbnz    r2, .LBB0_25
        ldrex   r2, [r0]
        cmp     r2, r1
        bne     .LBB0_10
        adds    r1, #1
        strex   r3, r1, [r0]
        cbnz    r3, .LBB0_11
        movs    r1, #1
        b       .LBB0_12
.LBB0_10:
        clrex
.LBB0_11:
        movs    r1, #0
.LBB0_12:
        cbnz    r1, .LBB0_24
        ldrex   r1, [r0]
        cmp     r1, r2
        bne     .LBB0_16
        adds    r2, #1
        strex   r3, r2, [r0]
        cbnz    r3, .LBB0_17
        movs    r2, #1
        b       .LBB0_18
.LBB0_16:
        clrex
.LBB0_17:
        movs    r2, #0
.LBB0_18:
        cbnz    r2, .LBB0_25
        ldrex   r2, [r0]
        cmp     r2, r1
        bne     .LBB0_22
        adds    r1, #1
        strex   r3, r1, [r0]
        cbnz    r3, .LBB0_23
        movs    r1, #1
        cmp     r1, #0
        beq     .LBB0_1
        b       .LBB0_24
.LBB0_22:
        clrex
.LBB0_23:
        movs    r1, #0
        cmp     r1, #0
        beq     .LBB0_1
.LBB0_24:
        mov     r1, r2
.LBB0_25:
        mov     r0, r1
        bx      lr

Code size is very important on Cortex-M targets (thumbv*) because some controllers have only 20KB flash!

Besides the increment instruction (adds r2, #1) was moved into ldrex/strex section. This code lost compare_exchange advantage: evaluation before ldrex to reduce tick count when Exclusive Monitor is set.

This code is useless because we can call fetch_add. But other tasks can requires compare_exchange_weak, for example atomic increment with max condition (pseudocode):

pub extern "C" fn atomic_usize_inc_with_max(ptr: &AtomicUsize) -> (bool, usize) {
    atomic {
        let old = ptr.load(Ordering::Relaxed);
        if old == max {
            return (false, old);
        }

        ptr.store(old + 1, Ordering::Relaxed);
        return (true, old);
    }
}

Meta

rustc --version --verbose:

rustc 1.50.0-nightly (1c389ffef 2020-11-24)
@qwerty19106 qwerty19106 added the C-bug Category: This is a bug. label Nov 25, 2020
@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-embedded Working group: Embedded systems and removed C-bug Category: This is a bug. labels Nov 25, 2020
@the8472
Copy link
Member

the8472 commented Nov 25, 2020

Code size is very important on Cortex-M targets (thumbv*) because some controllers have only 20KB flash!

It looks a lot better with -C opt-level=s

@qwerty19106
Copy link
Author

-C opt-level=2 is often used on microcontrollers if high perfomance is required. Usually it add ~20% to perfomance and add ~30% to flash using (as compared with -C opt-level=s).

I see two problems here:

  1. -C opt-level=2 produce too many asm instruction. Code above contains 4 the same ldrex/strex blocks, which use slightly different registers. The rest of the code is the glue in between ldrex/strex blocks.

The compare_exchange_weak was been inlined with both -C opt-level=2 and -C opt-level=s. That's why I expected about the same code in both cases.

Also if one compile this code with -C opt-level=2 --target armv7-unknown-linux-gnueabihf and -C opt-level=s --target armv7-unknown-linux-gnueabihf then produced code will be contains only one ldrex/strex block in both cases (godbolt)!
This proves that the I-heavy problem exists on thumbv7*** targets.

  1. Lock-free logic based on CAS primitive (compare_and_swap, compare_exchange and compare_exchange_weak).
    Tick count when Exclusive Monitor is set should be as less as possible.
    Thus any atomic primitive should work as (pseudocode):
pub extern "C" fn atomic_usize_func(ptr: &AtomicUsize) -> (bool, usize) {
    let mut old = ptr.load();

    loop {
        let new = calc_new_from_old();
        if !check_something(old, new) {
            break (false, old);
        }

        atomic {
            let _old = ptr.load();
            if old == _old {
                ptr.store(new);
                break (true, old);
            } else {
                old = _old;
            }
        }
    }
}

Note that atomic { ... } section should contains the same code at any calc_new_from_old and check_something functions.

Above we see that the increment instruction (adds r2, #1) was moved into ldrex/strex section.
The compiler_fence function not fluence to it because adds r2, #1 is not memory operation.
We could not prevent moving instruction into ldrex/strex block by compiler, but must guarantee it!

@the8472
Copy link
Member

the8472 commented Nov 26, 2020

-C opt-level=2 produce too many asm instruction. Code above contains 4 the same ldrex/strex blocks, which use slightly different registers.

That's probably LLVM's loop unrolling heuristics deciding that it's worth it for that loop under O2 but deciding differently under Os.

You could split your project into different crates, with the hot code that needs optimizations compiled with opt-level=2 and the cold/size-sensitive code in a different crate compiled with opt-level=s. RFC2282 enables per-crate build options. Alternatively you can extract the loop body into an #[inline(never)] function which should prevent loop-unrolling.

Also if one compile this code with -C opt-level=2 --target armv7-unknown-linux-gnueabihf and -C opt-level=s --target armv7-unknown-linux-gnueabihf then produced code will be contains only one ldrex/strex block in both cases (godbolt)!
This proves that the I-heavy problem exists on thumbv7*** targets.

Heuristics differ between target architectures, so it doesn't really prove something, it just shows that several different things are playing into each other here.

We could not prevent moving instruction into ldrex/strex block by compiler, but must guarantee it!

This could also be accomplished by splitting the atomic section into a #[inline(never)] function, then calc_new_from_old or check_something won't have their instructions moved into the critical section.

@nagisa
Copy link
Member

nagisa commented Nov 27, 2020

You can also use the attribute that sets the optimization level on a per-function basis. It is unstable but available.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-embedded Working group: Embedded systems
Projects
None yet
Development

No branches or pull requests

5 participants