Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMA based statistically adaptive thread pool design #108

Conversation

vertexclique
Copy link
Member

@vertexclique vertexclique commented Aug 24, 2019

This is an adaptive thread pool implementation with exponential moving averages(where it doesn't rely on idle time).

Here are the benchmarks:

This new statistical-EMA-approach:

  • CPU utilization ~97%
running 1 test
test blocking ... bench:  39,116,408 ns/iter (+/- 5,686,213)

Current approach

  • CPU utilization ~88%
running 1 test
test blocking ... bench: 254,204,382 ns/iter (+/- 214,099,973)

This approach enables 5.5x speedup.

On top of that, this PR solves #65 intrinsically
Just because it is not nice to manually include modules at the benchmark level.

Documentation

blocking doc comment

@yoshuawuyts
Copy link
Contributor

I'm in no means qualified to review this code, but the results are looking very cool! awesome work @vertexclique! 🎉

@ghost ghost requested a review from spacejam August 26, 2019 07:05
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
@vertexclique vertexclique force-pushed the ema-based-statistically-adaptive-thread-pool branch from 81837e7 to ecfce3a Compare August 26, 2019 11:44
@spacejam
Copy link
Contributor

Thanks for continuing to boil this down! These are really nice results that address the main runaway thread growth issue in the naive original implementation.

I'm working through some more correctness proofs on paper to try to find more ways to break the large number of Relaxed atomic accesses that are added in this.

@vertexclique
Copy link
Member Author

vertexclique commented Aug 26, 2019

Thanks, Tyler. Very nice to hear that. If you are doing formal verification please share with me too.

@spacejam
Copy link
Contributor

Hehehe the reason I'm using paper for a couple things is because I'm a little uncertain about translating Relaxed stuff into TLA (maybe using sets of all reorderable code and permuting them?)

Basically, anywhere a completely random value impacts runtime correctness is the only thing I'm looking for and desk-checking backwards from.

@vertexclique
Copy link
Member Author

More to improve (aka TO-DO):

  • Adaptive scale down (we can do this sometime later, it is harder and need more sync points than above)
  • Compartmentalize the pool creation into so-called execution context's so they satisfy the needs in the issue More flexible reactor/executor API #79 . This PR also helps that in an indirect way. (Atomics can be moved for that mentioned issue, no worries)

@spacejam
Copy link
Contributor

spacejam commented Aug 26, 2019

If more synchronization needs to occur, it probably isn't a bad time to take a step back from the 100% atomics-based approach and either use ROWEX when an adaptation is required (spawning readers work lock-free, adapting writers need to claim an exclusive mutex to reduce the concurrent interleaving space) or just using a Mutex so that the implementation is obviously correct. Thinking about all of these atomic accesses is quite labor intensive and error-prone.

Mutexes are quite fast compared to anything that needs to do IO, and I have doubts that using one would cause any impact on performance on a workload that this module is involved with.

@vertexclique
Copy link
Member Author

vertexclique commented Aug 26, 2019

I think todos can go to other PRs. For Mutexes (if I understand you correctly you are talking about the scaling down part) it can be useful, but I don't think they should be used for this code since there is no critical section code here, I don't like abusing lock-exclusion in places which are not needed. (But I am not fully against that).

I am rooting for this PR to taking it like how it is for the blocking pool rn. Then we will think on that later in the process of extraction (which will also be for async thread pool, I think).

@spacejam
Copy link
Contributor

I guess what I'm mostly getting at is that there are some potentially very subtle information flows that can happen when this math happens to be operating in an entirely concurrent way on global variables via Relaxed atomics (or even SeqCst atomics), and this is expensive overall to reason about as a human (it's also expensive/unreasonable for the formal verification tooling I know about to provide meaningful validations of).

@spacejam
Copy link
Contributor

There is almost never a real need for exclusion, but it makes PR's like this understandable without staring at it for hours :P

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
@spacejam
Copy link
Contributor

In general, please change all of the atomics to use barriers that prevent potentially risky reorderings. Note that in the original, relaxed atomics were used with a clear explanation for why randomness in their use path violated no correctness criteria. In this code, it's much much much more complex, and relaxed atomics make this quite difficult to reason about.

Mutexes will make this very easy for me to review and merge, and will not impact performance, so I recommend that you replace all of the atomic stuff with them, but I totally get that it's not as fun!

@spacejam
Copy link
Contributor

After removing all Relaxed atomics, I have some concerns about long-term maintenance costs of this code due to the large amount of unserialized atomic access, and math without clear documentation. Please document the math more clearly for people like me who are less familiar with it, and please document the atomic access patterns with explanations of what the important correctness criteria are at each dependency site so that future contributors don't introduce bugs.

Since bugs have been found in review, it would also make me feel better if some concurrency test was added that exercises the interleavings more.

You may think that this is a bit draconian, but lock-free programming is not easy for humans to get right, and this code is not obviously correct to me yet.

This needs a bit more work before it's safe from a long-term maintenance perspective. The alternative is to wrap all atomic variables other than FREQUENCY in a Mutex so that their complex math is serialized. Please try this out if you're uninterested in adding concurrency tests. Mutex unlocks are less than 16ns, and as long as you're just doing this math on variables inside the Mutex and not doing any IO or running a blocking function with the Mutex held, it will be really really fast.

So, please make this code obviously correct by either using a simple Mutex around all of the variables other than FREQUENCY or by adding concurrent interleaving tests that will cause assumptions we're making to pop out.

@vertexclique
Copy link
Member Author

r? @spacejam

src/task/blocking.rs Outdated Show resolved Hide resolved
// on the request rate.
//
// It uses frequency based calculation to define work. Utilizing average processing rate.
fn scale_pool() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another context there was some discussion around adding tracing (using log's trace! macro) to async-std. This functions seems like a good place for that, too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log::trace or the new tracing library? wondering about how much overhead each would add

Copy link
Contributor

@killercup killercup Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just know that the idea was thrown around, and it was probably an opt-in feature, too. @stjepang or @yoshuawuyts would know more :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log now has kv support, which we're already using in other places too. tracing can pick these up too, so using log is probably the best choice here overall (:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get the kv support in tracing-log soon-ish? However I don't believe that log supports the span-like macros, so async-std might need to have its own, internal span-like macro and dispatch conditionally to async-log or tracing.

I haven't considered how the differences between async-log's span and tracing's span can be bridged, but I think it's feasible if tracing's span!().enter() is used to mimic async-log's nested span block.

Copy link
Contributor

@yoshuawuyts yoshuawuyts Sep 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbarsky we could use async-log::span for this, probably. Though there's still a few open issues on the repo, the interface wouldn't change.

From convos with Eliza during RustConf there's def a desire to bridge between the two approaches, and we mostly need to figure out how to go about it. async-rs/async-log#7 is what I've currently got (repo incoming soon), but need to check how well that works (:

Copy link

@davidbarsky davidbarsky Sep 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoshuawuyts Yep! I think we're on the same page about wanting to bridge the two libraries.

I'm spitballing one mechanism on bridging that gap, which consists of using mutually exclusive feature flags in a build.rs to dispatch to either async-log or tracing. This would entail using an async-std-specific span macro until the gaps—if any exist—between async-std's span and tracing's span are closed. I'm personally excited to dig into rust-lang/log#353!

I'm sorry if what I communicated wasn't clear!

@vertexclique
Copy link
Member Author

So, to sum up, all the changes that I pushed:

  • Values moved around are mutex rn.
  • Now I cleanly calculate the values and also documented more explicitly how things are working. Even formula, very explicitly.
  • Fixed Blocking pool doesn't have backpressure on OSX #126 (OS-specific problem, hard to hit but possible)
  • No significant degradation in performance. For the curious:
running 2 tests
test blocking        ... bench:  48,959,842 ns/iter (+/- 46,653,507)
test blocking_single ... bench:       3,322 ns/iter (+/- 8,452)
  • I think I covered most of the cases with tests. Skipped all of them since they are exhaustive. But for the curious cats again here how you can run them:
$ cargo test longhauling_task_join -- --ignored --exact --nocapture
$ cargo test slow_join_interrupted -- --ignored --exact --nocapture
$ cargo test slow_join -- --ignored --exact --nocapture
  • NOTE: Benchmarks cover the common cases. Tests are covering the possible edge cases.

Copy link
Contributor

@killercup killercup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing these docs! I left some inline comments to clarify some of it. In general I think this is really neat, concise (just 300 LOC!) implementation. With a bit more docs this could be used as reference material to understand the paper and task scheduling, but I don't think we'll need to focus on that before landing this PR :)

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
// Should be less than 200_000 ns
// Previous implementation will panic when this test is running.
assert_eq!(elapsed < 200_000, true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are generally non-reproducible, and it's unclear if they are valuable. they tend to fail on my fairly modern 2018 i7 mbp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally). These might be OK as a benchmark so you can compare results on your own machine before and after changes, but I don't think they are useful as tests.

Copy link
Member Author

@vertexclique vertexclique Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try running the tests with master's blocking.rs. They will panic. Also, please take a look to referenced issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They panic on this branch on a fairly modern machine, so I'm not sure they are useful on machines other than yours. In general, timing-based performance tests are an antipattern, as they have big reproducibility issues. If you want to do this kind of testing, it's better to use something like cachegrind to count CPU instructions executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they don't actually assert on anything now (which was what caused them to fail on my laptop), I think they are better as benchmarks, right? I don't think it makes sense to have tests that can't fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to have tests that can't fail.

Please read my comments carefully. You can see what it tests in action:
https://asciinema.org/a/R6b3UcFo0xfNo1sCPOPJrB04T

when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally).

Unexposed architecture shouldn't be tested with whitebox fashion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the assertions that fail? The asciicinema is too long to watch.

Possible concurrent interleavings are exposed architecture, and this kind of test has resulted in a huge number of bugs being discovered in various lock-free projects I've been involved in. Concurrency is really hard to get right sometimes, and the original PR with more unserialized atomic accesses would have had bugs jump out pretty quickly with this approach.

src/task/blocking.rs Outdated Show resolved Hide resolved
src/task/blocking.rs Outdated Show resolved Hide resolved
@vertexclique vertexclique force-pushed the ema-based-statistically-adaptive-thread-pool branch from 0377429 to e934a5e Compare August 30, 2019 20:25
/// # Arguments
///
/// * `freq_queue` - Sliding window of frequency samples
#[inline]
Copy link
Contributor

@spacejam spacejam Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, humans are bad at guessing whether something should be inlined. Usually in rust it's better to leave these attributes out unless some code is going to be executed in a super hot path and you've done work to ensure the improvement is measurable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not contributing to the overall discussion + review. I’ve found it quite a bit out of context and also from the book. I've made the benchmarks already with and without and decided to leave it inlined. You can see below my last run again after your comment (run shouldn't be needed because of how compiler inlines and unrolls):

Mean of 10 consecutive runs of blocking benchmark:

43057207.9 ns without inlining
41816626.1 ns with inlining

That is also better for users of this library indirectly.

Copy link
Contributor

@spacejam spacejam Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit skeptical that change is related to the inline here because this is a function that gets called 5 times per second, and that measurement skew is ~1ms. For a workload that lasts ~40ms, it's not clear to me how measuring things that happen on the order of once every 200ms is relevant.

I'm asking you to simplify your proposed code because we are going to have to keep it working over time, and it's important not to add bits that add complexity for humans without clear benefits.

@vertexclique vertexclique deleted the ema-based-statistically-adaptive-thread-pool branch September 11, 2019 15:04
@vertexclique
Copy link
Member Author

vertexclique commented Sep 11, 2019

Please track the progress of #181

@vertexclique vertexclique restored the ema-based-statistically-adaptive-thread-pool branch September 20, 2019 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants