Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String::push is slow #116235

Open
lincot opened this issue Sep 28, 2023 · 8 comments
Open

String::push is slow #116235

lincot opened this issue Sep 28, 2023 · 8 comments
Labels
A-str Area: str and String I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@lincot
Copy link

lincot commented Sep 28, 2023

An alternative implementation of String::push, which reserves space if needed and then performs push_unchecked, gives a significant performance improvement.

current push:

test bench_push_1_byte  ... bench:      10,927 ns/iter (+/- 147) = 915 MB/s
test bench_push_2_bytes ... bench:      49,662 ns/iter (+/- 373) = 402 MB/s
test bench_push_3_bytes ... bench:      49,780 ns/iter (+/- 102) = 602 MB/s
test bench_push_4_bytes ... bench:      40,556 ns/iter (+/- 349) = 986 MB/s

new push:

test bench_push_1_byte  ... bench:      10,366 ns/iter (+/- 155) = 964 MB/s
test bench_push_2_bytes ... bench:      14,446 ns/iter (+/- 137) = 1384 MB/s
test bench_push_3_bytes ... bench:      19,238 ns/iter (+/- 265) = 1559 MB/s
test bench_push_4_bytes ... bench:      23,922 ns/iter (+/- 213) = 1672 MB/s
bench.rs
#![no_std]
#![feature(test)]

extern crate alloc;
extern crate test;
use alloc::string::String;
use test::{black_box, Bencher};

// #[inline]
// fn push(s: &mut String, ch: char) {
//     s.push(ch);
// }

#[inline]
fn push(s: &mut String, ch: char) {
    if s.capacity() - s.len() < ch.len_utf8() {
        s.reserve(ch.len_utf8());
    }
    unsafe { s.push_unchecked(ch) };
}

trait PushUnchecked<T> {
    unsafe fn push_unchecked(&mut self, value: T);
}

impl PushUnchecked<char> for String {
    #[inline]
    unsafe fn push_unchecked(&mut self, ch: char) {
        let len = self.len();
        let ptr = self.as_mut_vec().as_mut_ptr().add(len);
        let count = ch.len_utf8();
        debug_assert!(len + count <= self.capacity());
        match count {
            1 => {
                core::ptr::write(ptr, ch as u8);
            }
            2 => {
                core::ptr::write(ptr, (ch as u32 >> 6 & 0x1F) as u8 | 0b1100_0000);
                core::ptr::write(ptr.add(1), (ch as u32 & 0x3F) as u8 | 0b1000_0000);
            }
            3 => {
                core::ptr::write(ptr, (ch as u32 >> 12 & 0x0F) as u8 | 0b1110_0000);
                core::ptr::write(ptr.add(1), (ch as u32 >> 6 & 0x3F) as u8 | 0b1000_0000);
                core::ptr::write(ptr.add(2), (ch as u32 & 0x3F) as u8 | 0b1000_0000);
            }
            4 => {
                core::ptr::write(ptr, (ch as u32 >> 18 & 0x07) as u8 | 0b1111_0000);
                core::ptr::write(ptr.add(1), (ch as u32 >> 12 & 0x3F) as u8 | 0b1000_0000);
                core::ptr::write(ptr.add(2), (ch as u32 >> 6 & 0x3F) as u8 | 0b1000_0000);
                core::ptr::write(ptr.add(3), (ch as u32 & 0x3F) as u8 | 0b1000_0000);
            }
            _ => core::hint::unreachable_unchecked(),
        }
        self.as_mut_vec().set_len(len + count);
    }
}

const ITERATIONS: u64 = 10_000;

#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::new();
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::new();
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::new();
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::new();
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

It's worth noting that the current String::push implementation encodes characters into a temporary zero-initialized buffer. The zeroing alone is an unnecessary instruction, but this method is used in several places, notably String::insert and the default Write::write_char.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Sep 28, 2023
@the8472
Copy link
Member

the8472 commented Sep 28, 2023

Since you already wrote the code, make a PR? Though you might want to look into using vec.spare_capacity_mut() + slice::get_unchecked, which will add some debug asserts compared to the using direct pointer writes.

@the8472 the8472 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. A-str Area: str and String and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Sep 28, 2023
@lincot
Copy link
Author

lincot commented Sep 28, 2023

@the8472 this way? AFAIK get_unchecked_mut doesn't perform debug assertions. We already have a debug_assert! for capacity, though. As to the PR, do we want the new push_unchecked function duplicating char::encode_utf8 code?

@the8472
Copy link
Member

the8472 commented Sep 28, 2023

the debug asserts

assert_unsafe_precondition!(
"slice::get_unchecked requires that the index is within the slice",
[T](this: usize, slice: *const [T]) => this < slice.len()
);

@the8472
Copy link
Member

the8472 commented Sep 28, 2023

I have no opinion on push_unchecked, you can just make it a private function if you don't have a reason to expose it.

@lincot
Copy link
Author

lincot commented Sep 29, 2023

@the8472 it seems that the debug assert only applies if you compile core from source.

As for push_unchecked, I would like to see it in the library as I commonly use it after initializing strings with capacity.

I managed to make it identical to the get_unchecked_mut approach by reusing char::encode_utf8 with a hint.

@the8472
Copy link
Member

the8472 commented Sep 29, 2023

If you want to add API you may want to open an ACP first.
The internal optimization only needs a PR.

@lincot
Copy link
Author

lincot commented Oct 3, 2023

Although the get_unchecked_mut and char::encode_utf8 push_unchecked implementations give identical results when not inlined and with String::push, I have found that they behave very differently in other applications, the latter being slower. Also, the hint is too hacky.

Another solution is to add a new function encode_utf8_raw_unchecked to be called by push_unchecked and encode_utf8_raw. It is identical to the first approach in both applications and is almost the same for char::encode_utf8.

@lincot
Copy link
Author

lincot commented Oct 26, 2023

Another thing to note is that this version, unlike the current one, is unable to elide the capacity check in a simple ASCII push loop with a pre-allocated string. Both versions fail on non-ASCII and on long strings, though.

It seems that the current version is able to elide the check because it uses RawVec::reserve_for_push (which has #[inline(never)]) instead of the usual RawVec::reserve in the case of ASCII input. Interestingly, if we replace the reserve call with a panic, it gets elided even in the case of long non-ASCII strings.

If we put the reserve call behind #[inline(never)], it works for a short non-ASCII string, but not for a long one. RawVec::reserve itself performs a check and calls a cold reallocation function when needed. Locally I observed that if #[inline(never)] is applied instead of #[cold] or together with it, it only has a partial effect similar to putting String::reserve behind #[inline(never)].

Despite the issue, my version still outperforms the original in a benchmark with pre-allocated strings.

current push:

test bench_push_1_byte_prealloc           ... bench:      10,170 ns/iter (+/- 85) = 983 MB/s
test bench_push_1_byte_prealloc_blackbox  ... bench:      11,129 ns/iter (+/- 46) = 898 MB/s
test bench_push_2_bytes_prealloc          ... bench:       4,742 ns/iter (+/- 5) = 4217 MB/s
test bench_push_2_bytes_prealloc_blackbox ... bench:      49,587 ns/iter (+/- 41) = 403 MB/s
test bench_push_3_bytes_prealloc          ... bench:       4,744 ns/iter (+/- 5) = 6323 MB/s
test bench_push_3_bytes_prealloc_blackbox ... bench:      49,675 ns/iter (+/- 41) = 603 MB/s
test bench_push_4_bytes_prealloc          ... bench:       4,745 ns/iter (+/- 5) = 8429 MB/s
test bench_push_4_bytes_prealloc_blackbox ... bench:      42,509 ns/iter (+/- 29) = 940 MB/s

new push:

test bench_push_1_byte_prealloc           ... bench:       4,753 ns/iter (+/- 51) = 2103 MB/s
test bench_push_1_byte_prealloc_blackbox  ... bench:      12,419 ns/iter (+/- 83) = 805 MB/s
test bench_push_2_bytes_prealloc          ... bench:       4,750 ns/iter (+/- 27) = 4210 MB/s
test bench_push_2_bytes_prealloc_blackbox ... bench:      11,863 ns/iter (+/- 86) = 1685 MB/s
test bench_push_3_bytes_prealloc          ... bench:       4,748 ns/iter (+/- 26) = 6318 MB/s
test bench_push_3_bytes_prealloc_blackbox ... bench:      16,597 ns/iter (+/- 86) = 1807 MB/s
test bench_push_4_bytes_prealloc          ... bench:       4,751 ns/iter (+/- 27) = 8419 MB/s
test bench_push_4_bytes_prealloc_blackbox ... bench:      21,330 ns/iter (+/- 79) = 1875 MB/s

(you can see a small regression in bench_push_1_byte_prealloc_blackbox, it doesn't happen if the code isn't compiled for the native CPU or if std is compiled from source)

however, both versions work much worse than push_unchecked:

test bench_push_1_byte_prealloc           ... bench:         118 ns/iter (+/- 1) = 84745 MB/s
test bench_push_1_byte_prealloc_blackbox  ... bench:      10,041 ns/iter (+/- 18) = 995 MB/s
test bench_push_2_bytes_prealloc          ... bench:         328 ns/iter (+/- 0) = 60975 MB/s
test bench_push_2_bytes_prealloc_blackbox ... bench:       8,292 ns/iter (+/- 26) = 2411 MB/s
test bench_push_3_bytes_prealloc          ... bench:         491 ns/iter (+/- 2) = 61099 MB/s
test bench_push_3_bytes_prealloc_blackbox ... bench:      11,849 ns/iter (+/- 35) = 2531 MB/s
test bench_push_4_bytes_prealloc          ... bench:         689 ns/iter (+/- 4) = 58055 MB/s
test bench_push_4_bytes_prealloc_blackbox ... bench:      16,566 ns/iter (+/- 33) = 2414 MB/s
bench.rs
#![no_std]
#![feature(test)]

extern crate alloc;
extern crate test;
use alloc::string::String;
use test::{black_box, Bencher};

// #[inline]
// fn push(s: &mut String, ch: char) {
//     s.push(ch);
// }

#[inline]
fn push(s: &mut String, ch: char) {
    s.reserve(ch.len_utf8());
    unsafe { s.push_unchecked(ch) };
}

const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;

#[inline]
const fn len_utf8(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else if code < MAX_TWO_B {
        2
    } else if code < MAX_THREE_B {
        3
    } else {
        4
    }
}

#[inline]
unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [u8]) -> &mut [u8] {
    let len = len_utf8(code);
    match len {
        1 => {
            *dst.get_unchecked_mut(0) = code as u8;
        }
        2 => {
            *dst.get_unchecked_mut(0) = (code >> 6 & 0x1F) as u8 | TAG_TWO_B;
            *dst.get_unchecked_mut(1) = (code & 0x3F) as u8 | TAG_CONT;
        }
        3 => {
            *dst.get_unchecked_mut(0) = (code >> 12 & 0x0F) as u8 | TAG_THREE_B;
            *dst.get_unchecked_mut(1) = (code >> 6 & 0x3F) as u8 | TAG_CONT;
            *dst.get_unchecked_mut(2) = (code & 0x3F) as u8 | TAG_CONT;
        }
        4 => {
            *dst.get_unchecked_mut(0) = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
            *dst.get_unchecked_mut(1) = (code >> 12 & 0x3F) as u8 | TAG_CONT;
            *dst.get_unchecked_mut(2) = (code >> 6 & 0x3F) as u8 | TAG_CONT;
            *dst.get_unchecked_mut(3) = (code & 0x3F) as u8 | TAG_CONT;
        }
        _ => core::hint::unreachable_unchecked(),
    };
    dst.get_unchecked_mut(..len)
}

trait PushUnchecked<T> {
    unsafe fn push_unchecked(&mut self, value: T);
}

impl PushUnchecked<char> for String {
    #[inline]
    unsafe fn push_unchecked(&mut self, ch: char) {
        let len = self.len();
        let count = ch.len_utf8();
        let spare = core::slice::from_raw_parts_mut(
            self.as_mut_vec().as_mut_ptr().add(len),
            self.capacity() - len,
        );
        encode_utf8_raw_unchecked(ch as u32, spare);
        self.as_mut_vec().set_len(len + count);
    }
}

const ITERATIONS: u64 = 10_000;

#[bench]
fn bench_push_1_byte_prealloc(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity(ITERATIONS as _);
        for _ in 0..ITERATIONS {
            push(&mut s, CHAR);
        }
        s
        // vec![CHAR; ITERATIONS as _]
    });
}

#[bench]
fn bench_push_2_bytes_prealloc(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((2 * ITERATIONS) as _);
        for _ in 0..ITERATIONS {
            push(&mut s, CHAR);
        }
        s
        // vec![CHAR; ITERATIONS as _]
    });
}

#[bench]
fn bench_push_3_bytes_prealloc(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((3 * ITERATIONS) as _);
        for _ in 0..ITERATIONS {
            push(&mut s, CHAR);
        }
        s
        // vec![CHAR; ITERATIONS as _]
    });
}

#[bench]
fn bench_push_4_bytes_prealloc(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for _ in 0..ITERATIONS {
            push(&mut s, CHAR);
        }
        s
        // vec![CHAR; ITERATIONS as _]
    });
}

#[bench]
fn bench_push_1_byte_prealloc_blackbox(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity(ITERATIONS as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_2_bytes_prealloc_blackbox(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((2 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_3_bytes_prealloc_blackbox(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((3 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_4_bytes_prealloc_blackbox(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-str Area: str and String I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants