-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split node initialization and start #789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. Out of curiosity is the ABI node_create
gaining the ability to start with multiple initial handles, or is the multiple initial handles in node_start_instance
going to be internal only?
Interesting point, I was only thinking of it as internal only, but it may make sense to expose as part of |
node_infos: RwLock<HashMap<NodeId, NodeInfo>>, | ||
|
||
/// Currently running node instances, so that [`Runtime::stop`] can terminate all of them. | ||
node_instances: Mutex<HashMap<NodeId, Box<dyn crate::node::Node>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two fields look like they should be covered by the same Mutex
.
(In C++ this would probably be a base class that holds data (≈ NodeInfo
), but which still has pure virtual functions to be overridden (≈ node::Node
). Not completely sure what the Rust equivalent would be – maybe to extend the node::Node
trait to cover behaviour encompassed by what's in NodeInfo
, and then each implementor of the trait has-a NodeInfo
for that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I see the problem with the current split, do you have an example of something that may go wrong in this case? (perhaps I'll even add a test to make sure it does not happen, if it's doable)
Anyways, I suspect what you are suggesting is actually similar to the previous version, in which Node
(now NodeInfo
) had an instance
field itself, but to me, separating them this way makes Runtime::stop
cleaner (assuming it's not incorrect).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having small granular locks for things that may need to be updated in a synchronized fashion implies a locking hierarchy, and there's nothing here that documents such a locking hierarchy.
The obvious example is if there's (say) a current chunk of code that adds a new Node, and which does so by first adding to node_infos
then adds to node_instances
. If someone later adds different code that happens to add things to node_instances
before node_infos
, then there's a deadlock.
If all related state – here, everything that's indexed by NodeId
– is under the same lock, then there's no trap for unwary maintainers in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the point of this PR is exactly to decouple those two fields, so that they can be updated independently (no need to ever hold a lock over both). I have added a helper method add_node_instance
to make things clearer now, and added more comments. PTAL.
@@ -29,6 +29,10 @@ mod wasm; | |||
/// Nodes must not do any work until the [`Node::start`] method is invoked. | |||
pub trait Node: Send + Sync { | |||
/// Starts executing the node. | |||
/// | |||
/// This method may block, but node implementations should make sure that they create separate | |||
/// background threads to execute actual work, and wait on them to terminate when [`Node::stop`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really understand this comment. Why would start()
block if the work has to be done in a separate thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, that was quite confusing, I hope I clarified now, PTAL.
@@ -223,12 +220,12 @@ impl Runtime { | |||
return Ok(()); | |||
} | |||
|
|||
let nodes = self.nodes.read().unwrap(); | |||
let node_infos = self.node_infos.read().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pre-existing) This lock scope extends to the end of the method, which is needed because of the use of the NodeInfo
that's owned by the node_infos
field. I wonder if a with_node_info(fn)
accessor might make this pattern clearer/safer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we provide as a function in that case? If you are talking just about extracting tracked_handles
, I don't think that would improve things, we would still either return a reference, or have to clone it. I am probably missing your point though, could you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about something analogous to the ChannelMapping::with_channel
method; the function would just be the code in the rest of the block, but with the scope of lock & reference ownership clearer:
fn with_node_info<U, F: FnOnce(&NodeInfo) -> U>(&self, node_id: NodeId, f: F) -> U {
let node_infos = self.node_infos.read().unwrap();
let node_info = node_infos
.get(&node_id)
.expect("Invalid node_id passed into with_node_info");
f(node_info)
}
/// Validate the [`NodeId`] has access to [`Handle`], returning `Err(OakStatus::ErrBadHandle)`
/// if access is not allowed.
fn validate_handle_access(&self, node_id: NodeId, handle: Handle) -> Result<(), OakStatus> {
// Allow RUNTIME_NODE_ID access to all handles.
if node_id == RUNTIME_NODE_ID {
return Ok(());
}
self.with_node_info(node_id, |node_info| {
let tracked_handles = node_info.handles.lock().unwrap();
// Check the handle exists in the handles associated with a node, otherwise
// return ErrBadHandle.
if tracked_handles.contains(&handle) {
Ok(())
} else {
error!(
"validate_handle_access: handle {:?} not found in node {:?}",
handle, node_id
);
Err(OakStatus::ErrBadHandle)
}
})
}
Probably for another day though (maybe under #779)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it is just re-implementing lexical scoping? I agree with the sentiment, but I think we should instead be more conscious of how we scope variables (especially lock guards), possibly by adding explicit anonymous blocks to delimit scopes (with appropriate comments). But happy to keep discussing separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daviddrysdale what do you think about going even further and turning even e.g. let tracked_handles = node_info.handles.lock().unwrap();
into closure passing style (e.g. node_info.with_handles(|handles|{handles.containts(handle)})
). Some consistency around locking patterns might be nice, but I'm not sure if its worth going this far?
I'm also interested if scoping locks under lifetimes could be more ergonomic, something like:
pub fn with_all_channels<'a>(
&'a self,
) -> std::sync::RwLockReadGuard<'a, HashMap<ChannelId, Channel>> {
self.channels.read().unwrap()
}
// this would need something like MappedRwLockReadGuard from lock_api: https://docs.rs/lock_api/0.3.3/lock_api/struct.MappedRwLockReadGuard.html
pub fn with_channel<'a>(
&'a self,
channel_id: ChannelId,
) -> MappedRwLockReadGuard<'a, Result<Channel, OakStatus>> {
self.channels.read().unwrap().map(|channels|{
channels.get(&channel_id).ok_or(OakStatus::ErrBadHandle)
}
}
Although this would reintroduce lock scopes being overly large if not careful.(edit: lexical scoping)
dc73c44
to
57bcd7c
Compare
self.track_handles_in_node(node_reference, initial_handles); | ||
|
||
// Try to start the node instance, and return any errors. | ||
let result = node_instance.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general advice for writing clean code is that concurrency is a separate concern and usually deserves to be handle in a separate class (a dedicated data structure). I am not sure if this advice is applicable to Rust code, but in any other programming language I've used, I'd write a wrapper class to hold node_infos
and node_instances
with high level methods for get
, insert
, remove
, etc. All the concurrency logic (including any necessary syncing between node_infos
and node_instances
) would be handled in that class. The only problem I currently see with this approach is that if node_instance.start()
must absolutely be called in between updating node_infos
and updating node_instances
. Is that the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of this PR is so that the node instance can be started so that the relevant data structures (NodeInfo
in this case) are available to the runtime. I agree with the desire of encapsulating any synchronization in high level accessors, but I'm not entirely sure how to go about it here, perhaps let's discuss separately? If you have any idea in mind and you would like to prototype it in a PR, that would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let's discuss it separately. What I am thinking of is just a bit of refactoring that can be done separately, after you are done with this PR. And I think it might address some of the comments from @daviddrysdale too (but I am not a true Rustacean yet, so I could be wrong).
let result = node_instance.start(); | ||
|
||
// Regardless of the result of `Node::start`, insert the now running instance to the list of | ||
// running instances, so thatt `Node::stop` will be called on it eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// running instances, so thatt `Node::stop` will be called on it eventually. | |
// running instances, so that `Node::stop` will be called on it eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This also allows to write tests for the newly created `node_start_instance` function, extracted from `node_create`. Fixes project-oak#788
57bcd7c
to
470d845
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
@@ -223,12 +220,12 @@ impl Runtime { | |||
return Ok(()); | |||
} | |||
|
|||
let nodes = self.nodes.read().unwrap(); | |||
let node_infos = self.node_infos.read().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it is just re-implementing lexical scoping? I agree with the sentiment, but I think we should instead be more conscious of how we scope variables (especially lock guards), possibly by adding explicit anonymous blocks to delimit scopes (with appropriate comments). But happy to keep discussing separately.
let result = node_instance.start(); | ||
|
||
// Regardless of the result of `Node::start`, insert the now running instance to the list of | ||
// running instances, so thatt `Node::stop` will be called on it eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
self.track_handles_in_node(node_reference, initial_handles); | ||
|
||
// Try to start the node instance, and return any errors. | ||
let result = node_instance.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of this PR is so that the node instance can be started so that the relevant data structures (NodeInfo
in this case) are available to the runtime. I agree with the desire of encapsulating any synchronization in high level accessors, but I'm not entirely sure how to go about it here, perhaps let's discuss separately? If you have any idea in mind and you would like to prototype it in a PR, that would be great!
node_infos: RwLock<HashMap<NodeId, NodeInfo>>, | ||
|
||
/// Currently running node instances, so that [`Runtime::stop`] can terminate all of them. | ||
node_instances: Mutex<HashMap<NodeId, Box<dyn crate::node::Node>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the point of this PR is exactly to decouple those two fields, so that they can be updated independently (no need to ever hold a lock over both). I have added a helper method add_node_instance
to make things clearer now, and added more comments. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the discussions on lock scoping and granularity to hold this up
This also allows to write tests for the newly created
node_start_instance
function, extracted fromnode_create
.Fixes #788
Checklist
construction.