Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document/police/simplify locking hierarchy in Rust runtime #779

Closed
daviddrysdale opened this issue Mar 27, 2020 · 6 comments
Closed

Document/police/simplify locking hierarchy in Rust runtime #779

daviddrysdale opened this issue Mar 27, 2020 · 6 comments
Assignees

Comments

@daviddrysdale
Copy link
Contributor

There are various granular locks in the Rust runtime, and several places where multiple locks are acquired at the same time. The graph of locks-held is currently acyclic, but I'm concerned that it would be easy to accidentally change that and set ourselves up for a deadlock.

At the moment the graph seems to be:

runtime.channels.channels -> runtime.channels.readers
runtime.channels.channels -> runtime.channels.writers
runtime.channels.channels -> channel.waiting_threads
runtime.channels.channels -> channel.messages
channel.messages -> runtime.nodes
runtime.channels.channels -> runtime.nodes
runtime.nodes -> node.handles

There are also various places where the scope of locks could be significantly reduced, which should also help with this.

@daviddrysdale daviddrysdale changed the title Document/police locking hierarchy in Rust runtime Document/police/simplify locking hierarchy in Rust runtime Mar 30, 2020
@daviddrysdale
Copy link
Contributor Author

Suggestions:

Net, that would drop the number of distinct locks from 8 to 3:

  • runtime.node_infos
  • runtime.node_instances
  • runtime.channels.channels
  • runtime.channels.readers
  • runtime.channels.writers
  • channel.messages
  • channel.waiting_threads
  • node_info.handles

@daviddrysdale daviddrysdale self-assigned this Apr 21, 2020
@daviddrysdale
Copy link
Contributor Author

One new discovery: there's also another lock hidden under the covers when doing output (std::io::stdio::Stdout::lock), so it's not a good idea to do anything with locks in logging output.

So:

impl std::fmt::Debug for Channel {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(
            f,
            "Channel {{ #msgs={}, #readers={}, #writers={}, label={:?} }}",
            self.messages.read().unwrap().len(),
            self.readers.load(SeqCst),
            self.writers.load(SeqCst),
            self.label,
        )
    }
}

would be better as something like:

impl std::fmt::Debug for Channel {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        // Don't hold a lock while performing output.
        let msg_count = { self.messages.read().unwrap().len() };
        write!(
            f,
            "Channel {{ id={}, #msgs={}, #readers={}, #writers={}, label={:?} }}",
            self.id,
            msg_count,
            self.reader_count.load(SeqCst),
            self.writer_count.load(SeqCst),
            self.label,
        )
    }
}

@tiziano88
Copy link
Collaborator

Are the braces on the RHS necessary? I thought since it's just extracting an int, which is Copy, the scope of the lock is just until the end of line.

@daviddrysdale
Copy link
Contributor Author

I think you're right, but there seem to be some interesting gotchas with Rust temporary lifetimes that make me want to be extra careful.

For example, from these docs I wonder if part of the original problem is triggered because the write! is the final expression of the function body.

@daviddrysdale
Copy link
Contributor Author

Current state is that we're down to 4 locks:

  • Runtime::node_infos (RwLock)
  • Runtime::aux_servers (Mutex)
  • Channel::messages (RwLock)
  • Channel::waiting_threads (Mutex)

Each lock is taken alone in normal runtime operation, but there are a few places that use the combination Runtime::node_infos -> Channel::messages (during shutdown or runtime-wide introspection).

So I think we can close this now.

@daviddrysdale
Copy link
Contributor Author

Also, our CI now (#915) runs both integration tests and a key unit test with TSAN, which should help prevent problems creeping in in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants