Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weird behavior when panicking inside synchronization primitives #58042

Closed
wants to merge 1 commit into from

Conversation

jethrogb
Copy link
Contributor

@jethrogb jethrogb commented Feb 1, 2019

Currently, when the stack is unwound while an MPSC channel receiver is blocking on a receive, this will result in the execution of unreachable!() or a deadlock. Various platforms may trigger unwinding while a thread is blocked, due to an error condition.

This PR changes the MPSC drop logic so that unwinding works properly. It also adds tests for all MPSC variants. It can be tricky to trigger unwinding while a thread is blocked. Here I've used pthread_cancel on Linux.

Fixes fortanix/rust-sgx#86

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 1, 2019
@jethrogb
Copy link
Contributor Author

jethrogb commented Feb 1, 2019

I haven't yet fixed the stream and shared implementations, I wanted to get some feedback on this PR first.

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:01643448:start=1549008525142296323,finish=1549008597642748465,duration=72500452142
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=x86_64-gnu-llvm-6.0
---
[01:10:27] FF..............................................
[01:10:27] thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:502:22
[01:10:27] failures:
[01:10:27] 
[01:10:27] ---- [run-fail] run-fail/mpsc-recv-unwind/stream.rs stdout ----
[01:10:27] 
[01:10:27] error: Error: expected failure status (Signal(6)) but received status ExitStatus(ExitStatus(256)).
[01:10:27] status: exit code: 1
[01:10:27] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-fail/mpsc-recv-unwind/stream/a"
[01:10:27] ------------------------------------------
[01:10:27] Deadlock detected
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] ------------------------------------------
[01:10:27] stderr:
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] thread '[run-fail] run-fail/mpsc-recv-unwind/stream.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:3311:9
[01:10:27] 
[01:10:27] ---- [run-fail] run-fail/mpsc-recv-unwind/shared.rs stdout ----
[01:10:27] 
[01:10:27] error: Error: expected failure status (Signal(6)) but received status ExitStatus(ExitStatus(256)).
[01:10:27] status: exit code: 1
[01:10:27] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-fail/mpsc-recv-unwind/shared/a"
[01:10:27] ------------------------------------------
[01:10:27] Deadlock detected
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] ------------------------------------------
[01:10:27] stderr:
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] ------------------------------------------
[01:10:27] 
[01:10:27] thread '[run-fail] run-fail/mpsc-recv-unwind/shared.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:3311:9
[01:10:27] 
[01:10:27] failures:
[01:10:27] failures:
[01:10:27]     [run-fail] run-fail/mpsc-recv-unwind/shared.rs
[01:10:27]     [run-fail] run-fail/mpsc-recv-unwind/stream.rs
[01:10:27] test result: FAILED. 145 passed; 2 failed; 1 ignored; 0 measured; 0 filtered out
[01:10:27] 
[01:10:27] 
[01:10:27] 
[01:10:27] 
[01:10:27] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/run-fail" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-fail" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "run-fail" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
[01:10:27] 
[01:10:27] 
[01:10:27] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:10:27] Build completed unsuccessfully in 0:11:04
[01:10:27] Build completed unsuccessfully in 0:11:04
[01:10:27] make: *** [check] Error 1
[01:10:27] Makefile:48: recipe for target 'check' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:0103c0bb
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri Feb  1 09:20:35 UTC 2019
---
travis_time:end:1ba81ae2:start=1549012837125012095,finish=1549012837130425587,duration=5413492
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:1b5084f8
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|oBRT, Aborted.
#0  0x00007fce80c35428 in ?? ()
[Current thread is 1 (LWP 24071)]
#0  0x00007fce80c35428 in ?? ()
#1  0x00007fce80c3702a in ?? ()
#2  0x0000000000000020 in ?? ()
#3  0x0000000000000000 in ?? ()

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@alexcrichton
Copy link
Member

Sorry I don't really understand what's motivating this PR, this seems like something that surely would have been caught in the half a decade these types have been in use. Is this something like the platform is causing a panic at a location that wasn't expected? If so, where?

@jethrogb
Copy link
Contributor Author

jethrogb commented Feb 4, 2019

Sorry I don't really understand what's motivating this PR

The 4 test cases that I've added are motivating this. Run them today and you'll run into unreachable! or deadlocks. Ideally, the tests would be run-pass tests, but I couldn't figure out how to write them that way, so instead I'm checking for a specific failure condition that indicates everything as expected.

this seems like something that surely would have been caught in the half a decade these types have been in use.

I guess not. Like I said, it can be tricky to trigger unwinding while a thread is blocked.

Is this something like the platform is causing a panic at a location that wasn't expected? If so, where?

Yes, inside a call to mpsc::Receiver::recv

@bors
Copy link
Contributor

bors commented Feb 9, 2019

☔ The latest upstream changes (presumably #58316) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

Thanks for the explanations, but I don't think this is something that we'll want to support in the standard library. Thread cancellation seems like it's an additional features to the channels in the standard library, and I think it'd be fine to document that they're not compatible but otherwise feature development of channels is encouraged to happen externally in crates like crossbeam-channel for this use case.

@jethrogb
Copy link
Contributor Author

I don't think you've understood the heart of the issue yet. The problem is not specific to thread cancellation. The issue occurs when when the stack is unwound while an MPSC channel receiver is blocking on a receiver. Thread cancellation is just an easy way to showcase this on Linux, but there are other ways to trigger unwinding at that point.

@alexcrichton
Copy link
Member

Can you show an example test case that deadlocks without using thread cancellation?

@jethrogb
Copy link
Contributor Author

According to the Linux man page for pthread_cond_timedwait, it may return with EINTR. I haven't been able to figure out how to trigger that, but here I've simulated that by just implementing pthread_cond_timedwait. This program never exits.

#![feature(rustc_private)] // for libc, you can also use crates.io libc instead.

extern crate libc;

#[no_mangle]
pub unsafe extern "C" fn pthread_cond_timedwait(
    _cond: *mut libc::pthread_cond_t,
    _mutex: *mut libc::pthread_mutex_t,
    _abstime: *const libc::timespec
) -> libc::c_int {
    *libc::__errno_location() = libc::EINTR;
    return 1;
}

fn main() {
    let (s, r) = std::sync::mpsc::channel();
    s.send(()).unwrap();
    r.recv().unwrap();
    s.send(()).unwrap();
    r.recv().unwrap();
    r.recv_timeout(std::time::Duration::from_millis(1)).unwrap()
}

@jethrogb
Copy link
Contributor Author

If unwinding were supported in Wasm, I'm pretty this would hang right now after printing the panic message:

#[cfg(test)]
extern crate wasm_bindgen_test;
#[cfg(test)]
use wasm_bindgen_test::*;

#[cfg(test)]
#[wasm_bindgen_test]
pub fn recv() {
    let (tx, rx) = std::sync::mpsc::channel::<()>();
    let _ = tx.clone();
    let _ = rx.recv();
}

@alexcrichton
Copy link
Member

EINTR should be handled by the condvar implementation and it's a bug if it isn't, and wasm is a special case that I don't think really counts for this because it doesn't have threads anyway

@jethrogb
Copy link
Contributor Author

Note to triage: I will return to this PR March 4.

@Dylan-DPC-zz
Copy link

ping from triage @jethrogb it is march 4 today ;) you have a conflict to resolve

@Dylan-DPC-zz Dylan-DPC-zz added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 4, 2019
@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

@Dylan-DPC lol the conflict is the least of my problems on this PR :)

@Dylan-DPC-zz
Copy link

@jethrogb i know :P

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

In private chat message, Alex suggested I ping @Amanieu @aturon @BurntSushi @dtolnay @Kimundi @KodrAus @sfackler @SimonSapin @withoutboats

This discussion is also relevant for #58461 as these issues address similar problems.

I think std should always behave reasonably in the face of uncommon errors triggered by the platform. In my opinion, "reasonable behavior" does not include deadlocks, actually calling unreachable!, poisoning locks incorrectly, or double panics. The code in sync/ and sys/ is littered with unwrap and assert!, many of those could trigger the "unreasonable" behavior I described in these PRs.

So, these PRs consists of two parts:

  1. A fix for the dubious behavior caused by uncommon platform errors.
  2. A test for these fixes. The proposed test cases are pretty weird, but that's just to be able to test these kinds of failures consistently. Just because we can't trigger a particular failure easily without going through hoops doesn't mean these things can't happen. A future change to the Linux kernel could just start returning an error code from a syscall that we weren't expecting before. People might be using Rust with LD_PRELOAD wrappers for some kind of instrumentation. New platforms added to std might error in previously unexpected cases. Et cetera.

NB. I'm sure these problems will come up again when replacing some of the primitives in std with parking_lot and crossbeam-channel.

@alexcrichton
Copy link
Member

To be clear I recommended the possibility looping in other folks, not necessarily pinging every historical and current libs team member by name.

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

I'm sorry if I pinged people who are not actually on the team right now, I got the list from https://www.rust-lang.org/governance/teams/library.

@Amanieu
Copy link
Member

Amanieu commented Mar 4, 2019

If the underlying platform is violating its own ABI, either due to updates or because the user is deliberately messing with us (LD_PRELOAD), then I don't think that we can reasonably be expected to function correctly.

A panic message followed by either a crash or a deadlock should be a pretty clear message to the user that says "YOUR PLATFORM IS BROKEN".

@Amanieu
Copy link
Member

Amanieu commented Mar 4, 2019

According to the Linux man page for pthread_cond_timedwait, it may return with EINTR.

My manpage (and the POSIX spec) specifically say that pthread_cond_timedwait will never return EINTR.

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

Note that I've provided 3 different test cases only one of which has to do with the "platform is violating its own ABI". I really think you should all focus less on the test cases and more on the general issue.

The Ubuntu 16.04 man page definitely says EINTR may happen.

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

Another way a panic might occur in “unexpected” places is if the process is using Linux seccomp.

I must say that for a language that has a big focus on correct error handling, I'm somewhat surprised by the pushback on doing exactly that.

@alexcrichton
Copy link
Member

I personally agree with @Amanieu that if fundamental APIs are changed we can't really do anything about that and crashing as soon as we can is the best bet. For example if pthread_create actually spawned two threads, I wouldn't expect us to gracefully handle that and shut down one of them while the other continues to run user code.

If actual real-world APIs disagree with the spec I think it's fine to handle that, and as I've said a number of times on #58461 as well it's fine to handle EINTR for condvars, we'd just loop again or return that we timed out. If this comes up with seccomp as well that's a new error to handle and we can add that in.


As a meta point the discussion is being strewn across this PR and #58461 and keeping track of them isn't the easiest to do. Can one of these PRs be closed and merged with the other or something like that?

@jethrogb jethrogb changed the title Fix reaching unreachable code or deadlocks when unwinding in MPSC Fix weird behavior when panicking inside synchronization primitives Mar 14, 2019
@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 14, 2019

I've closed #58461

I'm only interested in solving the general problem of making the synchronization primitives robust against panicking. I'm not interested in auditing all code in sync/ and sys/ for possible error conditions that were missed. Maybe once Rust gets a way to enforce "no panics" as a function attribute that can be done. If you don't think there's a general problem to be solved, then I can't make any progress on this PR.

@Dylan-DPC-zz
Copy link

ping from triage @jethrogb any updates?

@jethrogb
Copy link
Contributor Author

Should be S-waiting-for-review

@Centril Centril added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 30, 2019
@Centril
Copy link
Contributor

Centril commented Mar 30, 2019

Ping from triage @alexcrichton

@alexcrichton
Copy link
Member

I'm going to close this PR because those on the libs team who have voiced their thoughts on this PR have indicated that they would not take this strategy. Otherwise it looks like this PR isn't making much progress.

We're always interested in fixes to primitives if necessary, but I don't think we're going to place too many safeguards in place to have the standard library work as correctly as possible in the face of a system that is misbehaving. Rather we'll make sure an error message gets out and nothing segfaults at minimum.

@jethrogb
Copy link
Contributor Author

jethrogb commented Apr 1, 2019

Deadlocks are not in the set of acceptable behaviors you just described ("an error message gets out and nothing segfaults").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants