Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only box the function once when creating threads. #31621

Closed
wants to merge 1 commit into from

Conversation

rphmeier
Copy link
Contributor

I felt like non-capturing thread functions really shouldn't require any heap allocations.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@alexcrichton
Copy link
Member

Personally I don't necessarily feel that this is worth the cost. The farther down we push the generics the more code we have to monomorphize for any vanilla call to thread::spawn, which makes "hello world" style programs compile that much longer. In the pasts I've seen thread::spawn take compile times from 10ms to 500ms, which is certainly quite noticeable!

Without performance numbers one way or another (and I would be quite skeptical that this could show up in any profile), I'd prefer to stick to concrete types instead of generics wherever possible.

@rphmeier
Copy link
Contributor Author

I had anticipated that as a possible downside, and the cost of any allocation would definitely be dwarfed by spawning the thread. In that sense, this change is unsubstantiated. However, I personally like to have very fine-grained control over how memory is used in my programs, and it's always a little disheartening to look into std and see hidden allocations like this, even when they don't really mean much.

Another approach would be to violate parametricity in thread::spawn, with a different path for ZSTs. The only way to make that work would be to use "raw" in the ZST version to pass the vtable as the thread args, which feels very hacky.

@arielb1
Copy link
Contributor

arielb1 commented Feb 13, 2016

Box::new(zst) does not actually allocate anyway.

@rphmeier
Copy link
Contributor Author

@arielb1 Yes, but the old code boxes the box, which does allocate.

The core of this issue is that you need a pointer-width argument to pass to the system thread creation function. You can't get that from an *mut Trait, so the old code boxed it again to get an *mut Box<FnBox()>.

What I meant by using raw in the ZST path is to deconstruct the first Box into a TraitObject, pass only the pointer-width vtable, and rebuild it on the other side. This is definitely hacky, but would probably work.

@arielb1
Copy link
Contributor

arielb1 commented Feb 13, 2016

Ah right. I would have new be a shim, as so:

    pub unsafe fn new<F: FnOnce()>(stack: usize, p: F) -> io::Result<Thread> {
        let p = box p;
        match Self::new_inner(stack, &*p as *const F as *const libc::c_void, thread_start::<F>) {
            Ok(o) => { mem::forget(p); Ok(o) }
            Err(e) => Err(e)
        }

        extern fn thread_start<F: FnOnce()>(main: *mut libc::c_void)
            -> *mut libc::c_void {
            unsafe {
                let main = Box::from_raw(main as *mut F);
                start_thread(main);
            }
            ptr::null_mut()
        }
    }

    unsafe fn new_inner<'a>(stack: usize, p: *const libc::c_void,
                            f: extern "C" fn(*mut libc::c_void) -> *mut libc::c_void)
                            -> io::Result<Thread> {
        // ..
    }

There's a potential use-after-free if new_inner panics after spawning the new thread, but the use-after-free is already present today: there's a

        assert_eq!(libc::pthread_attr_destroy(&mut attr), 0);

line - if it panics, we get a double-free of p. This is a place where #[noexcept] would be useful.

This way we also avoid creating a monomorphzied vtable.

@rphmeier
Copy link
Contributor Author

Making new a shim is probably a good idea -- it'll reduce the volume of monomorphized code.

w.r.t. the destruction of the thread attributes, I really doubt that would ever fail if the pthreads implementation is compliant.

@alexcrichton
Copy link
Member

In that sense, this change is unsubstantiated

Having concrete data showing that we pay for this change without showing how the tradeoff is worth it seems bad to me though?

@rphmeier
Copy link
Contributor Author

@alexcrichton I'll try to drum up a benchmark to see how much this really hurts compile times. It's not as though we don't already do some monomorphization for thread::spawn, so I wouldn't expect there to be a huge increase.

@rphmeier
Copy link
Contributor Author

I did a
time make check-stage1-std NO_REBUILD=1 NO_BENCH=1

for each of:

  • master (HEAD = 0c4d81f)
  • thread_box_once
  • thread_box_once_shim (a local branch)

which yielded the following times after two runs:

Branch Run 1 Run 2
master 2m2.351s 1m59.618s
thread_box_once 2m11.934s 2m13.587s
thread_box_once_shim 2m5.279s 2m4.556s

The std tests use thread::spawn extensively in the thread, 'net', and sync modules, so I figure this was as good a test case as any. I couldn't figure out how to just measure the compile times without running (something like cargo test --no-run), so these numbers haphazardly represent the aggregate of compile-time regression and thread::spawn performance increase (which is probably ~0).

I consider this level of regression acceptable, especially considering the sheer volume of thread::spawn calls in the tests, 199 in libstd alone (git grep thread::spawn | awk '/libstd/ { print $0 }' | wc -l).

@rphmeier rphmeier force-pushed the thread_box_once branch 2 times, most recently from b4e9419 to 2543cd7 Compare February 15, 2016 04:31
@arielb1
Copy link
Contributor

arielb1 commented Feb 15, 2016

@rphmeier

I figure that you might want to manually inline start_thread (the stack overflow handler) into thread_start - this will avoid monomorphizing the <[closure] as FnBox> vtable and possibly speed up compile times even more.

@rphmeier
Copy link
Contributor Author

@arielb1 I tried doing the inlined variety here: rphmeier@9fd687b

But it seems to have slower compile times than the previous shim, but not as bad as my first attempt.
I do wonder whether the boxing is necessary at all with this restructuring of the code. Theoretically, you could just pass a pointer to "p" on the stack frame, and then use it the same way. This requires additional synchronization via atomics such that you don't mem::forget(p) or return from Thread::new() before the data has been read on the other thread. In my local proof-of-concept, this passed the tests, but test compile + run time took slightly more than the boxing version.

@arielb1
Copy link
Contributor

arielb1 commented Feb 15, 2016

thats... a bit odd, but go figure. What measurement method were you using?

@rphmeier
Copy link
Contributor Author

Just a time make check-stage1-std NO_REBUILD=1 NO_BENCH=1

If you have any other suggestions for benchmarking, I'd be glad to hear them. I don't know very much about the Rust build system, so I'm sure there are better ways to perform this test.

@arielb1
Copy link
Contributor

arielb1 commented Feb 15, 2016

You can cut out the middleman

$ rustc ../src/libstd/lib.rs --test -Z time-passes

in opt mode or in no-opt mode.

@alexcrichton
Copy link
Member

I was curious what this did to compile times in a more extreme case, as stdtest may not give us great insight into compiling this function specifically. The test case I had was spawning 1000 threads (each returning a new integer) and then just dropping all the handles (no join). An optimized compile of this program went from 9:06.92 to 14:18.06, a 36% slowdown.

Have you measured any runtime benefit to this? I'm not super eager to make the standard library more generic without any runtime benefit.

@rphmeier
Copy link
Contributor Author

I did the same test as you're proposing, but with only 200 threads as opposed to 1000, since 1000 ran on my laptop for about half an hour in LLVM with no signs of finishing. With 200 threads, I measured a 1.8% compilation slowdown with my changes (47.277s to 48.148s). Running rustc -Z time-passes indicated that for N=1000, it got hung up in LLVM, which probably means that there's a non-linear-complexity algorithm leading to the massive spike we're observing. Regardless, I don't think it's related to my changes.

Actually running the benchmark:

master:

running 1 test
test bench_run ... bench:   5,535,172 ns/iter (+/- 4,100,966)

thread_box_once:

running 1 test
test bench_run ... bench:   5,436,470 ns/iter (+/- 1,974,170)

So it seems that the runtimes are just about even. The 1.8% slowdown amounted to a 0.871s increase in compile times, or about 5ms per thread::spawn call. I'm not going to dismiss it, but I think that this is definitely a tolerable increase.

@arielb1
Copy link
Contributor

arielb1 commented Feb 17, 2016

Maybe try a strategically placed inline(never)?

@alexcrichton
Copy link
Member

I don't think it's related to my changes.

What I mean to point out is that making these functions more generic comes at a cost, and that cost is inlining more code downstream. This is minimized here, but the cost is not zero.

So it seems that the runtimes are just about even.

While I agree with the conclusion the method here doesn't look like it's producing accurate results. The variance in those measurements is massive, so we can't really gain any information from that. I suspect, however, that it's basically absolutely guaranteed that there is no difference in perf from one more allocation when spawning a thread

@bors
Copy link
Contributor

bors commented Mar 23, 2016

☔ The latest upstream changes (presumably #32390) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to resubmit with a rebase!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants