Document/police/simplify locking hierarchy in Rust runtime #779

daviddrysdale · 2020-03-27T14:33:46Z

There are various granular locks in the Rust runtime, and several places where multiple locks are acquired at the same time. The graph of locks-held is currently acyclic, but I'm concerned that it would be easy to accidentally change that and set ourselves up for a deadlock.

At the moment the graph seems to be:

runtime.channels.channels -> runtime.channels.readers
runtime.channels.channels -> runtime.channels.writers
runtime.channels.channels -> channel.waiting_threads
runtime.channels.channels -> channel.messages
channel.messages -> runtime.nodes
runtime.channels.channels -> runtime.nodes
runtime.nodes -> node.handles

There are also various places where the scope of locks could be significantly reduced, which should also help with this.

The text was updated successfully, but these errors were encountered:

daviddrysdale · 2020-04-01T10:33:54Z

Suggestions:

Remove let bindings of Mutex guards where possible, to reduce scope of locking. (In Reduce scope and number of locks in Rust runtime #798)
Drop the NodeInfo.handles lock, as it's always held in combination with runtime.node_infos. (In Reduce scope and number of locks in Rust runtime #798)
Drop the Channel.waiting_threads mutex, as it's always held in combination with runtime.channels.channels.
Combine the runtime.node_infos and runtime.node_instances locks.
Combine the runtime.channels.channels, runtime.channels.readers and runtime.channels.writers locks into a single lock.

Net, that would drop the number of distinct locks from 8 to 3:

runtime.node_infos
~~runtime.node_instances~~
runtime.channels.channels
~~runtime.channels.readers~~
~~runtime.channels.writers~~
channel.messages
~~channel.waiting_threads~~
~~node_info.handles~~

daviddrysdale · 2020-04-30T06:16:14Z

One new discovery: there's also another lock hidden under the covers when doing output (std::io::stdio::Stdout::lock), so it's not a good idea to do anything with locks in logging output.

So:

impl std::fmt::Debug for Channel {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(
            f,
            "Channel {{ #msgs={}, #readers={}, #writers={}, label={:?} }}",
            self.messages.read().unwrap().len(),
            self.readers.load(SeqCst),
            self.writers.load(SeqCst),
            self.label,
        )
    }
}

would be better as something like:

impl std::fmt::Debug for Channel {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        // Don't hold a lock while performing output.
        let msg_count = { self.messages.read().unwrap().len() };
        write!(
            f,
            "Channel {{ id={}, #msgs={}, #readers={}, #writers={}, label={:?} }}",
            self.id,
            msg_count,
            self.reader_count.load(SeqCst),
            self.writer_count.load(SeqCst),
            self.label,
        )
    }
}

tiziano88 · 2020-04-30T06:21:58Z

Are the braces on the RHS necessary? I thought since it's just extracting an int, which is Copy, the scope of the lock is just until the end of line.

daviddrysdale · 2020-04-30T06:42:22Z

I think you're right, but there seem to be some interesting gotchas with Rust temporary lifetimes that make me want to be extra careful.

For example, from these docs I wonder if part of the original problem is triggered because the write! is the final expression of the function body.

daviddrysdale · 2020-05-08T05:25:13Z

Current state is that we're down to 4 locks:

Runtime::node_infos (RwLock)
Runtime::aux_servers (Mutex)
Channel::messages (RwLock)
Channel::waiting_threads (Mutex)

Each lock is taken alone in normal runtime operation, but there are a few places that use the combination Runtime::node_infos -> Channel::messages (during shutdown or runtime-wide introspection).

So I think we can close this now.

daviddrysdale · 2020-05-08T05:28:16Z

Also, our CI now (#915) runs both integration tests and a key unit test with TSAN, which should help prevent problems creeping in in future.

daviddrysdale mentioned this issue Mar 30, 2020

Possible data race in Rust runtime #780

Closed

daviddrysdale changed the title ~~Document/police locking hierarchy in Rust runtime~~ Document/police/simplify locking hierarchy in Rust runtime Mar 30, 2020

This was referenced Mar 31, 2020

Split node initialization and start #789

Merged

Reduce scope and number of locks in Rust runtime #798

Merged

daviddrysdale self-assigned this Apr 21, 2020

daviddrysdale added the lang/Rust label Apr 30, 2020

daviddrysdale closed this as completed May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document/police/simplify locking hierarchy in Rust runtime #779

Document/police/simplify locking hierarchy in Rust runtime #779

daviddrysdale commented Mar 27, 2020

daviddrysdale commented Apr 1, 2020

daviddrysdale commented Apr 30, 2020

tiziano88 commented Apr 30, 2020

daviddrysdale commented Apr 30, 2020

daviddrysdale commented May 8, 2020

daviddrysdale commented May 8, 2020

Document/police/simplify locking hierarchy in Rust runtime #779

Document/police/simplify locking hierarchy in Rust runtime #779

Comments

daviddrysdale commented Mar 27, 2020

daviddrysdale commented Apr 1, 2020

daviddrysdale commented Apr 30, 2020

tiziano88 commented Apr 30, 2020

daviddrysdale commented Apr 30, 2020

daviddrysdale commented May 8, 2020

daviddrysdale commented May 8, 2020