Split node initialization and start #789

tiziano88 · 2020-03-30T21:16:41Z

This also allows to write tests for the newly created
node_start_instance function, extracted from node_create.

Fixes #788

Checklist

Pull request affects core Oak functionality (e.g. runtime, SDK, ABI)
- I have written tests that cover the code changes.
Pull request includes prototype/experimental work that is under
construction.

blaxill

Lgtm. Out of curiosity is the ABI node_create gaining the ability to start with multiple initial handles, or is the multiple initial handles in node_start_instance going to be internal only?

tiziano88 · 2020-03-30T22:23:01Z

Lgtm. Out of curiosity is the ABI node_create gaining the ability to start with multiple initial handles, or is the multiple initial handles in node_start_instance going to be internal only?

Interesting point, I was only thinking of it as internal only, but it may make sense to expose as part of node_create too actually! But let's look into it separately.

daviddrysdale · 2020-03-31T08:00:32Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

+    node_infos: RwLock<HashMap<NodeId, NodeInfo>>,
+
+    /// Currently running node instances, so that [`Runtime::stop`] can terminate all of them.
+    node_instances: Mutex<HashMap<NodeId, Box<dyn crate::node::Node>>>,


These two fields look like they should be covered by the same Mutex.

(In C++ this would probably be a base class that holds data (≈ NodeInfo), but which still has pure virtual functions to be overridden (≈ node::Node). Not completely sure what the Rust equivalent would be – maybe to extend the node::Node trait to cover behaviour encompassed by what's in NodeInfo, and then each implementor of the trait has-a NodeInfo for that.)

I am not sure I see the problem with the current split, do you have an example of something that may go wrong in this case? (perhaps I'll even add a test to make sure it does not happen, if it's doable)

Anyways, I suspect what you are suggesting is actually similar to the previous version, in which Node (now NodeInfo) had an instance field itself, but to me, separating them this way makes Runtime::stop cleaner (assuming it's not incorrect).

Having small granular locks for things that may need to be updated in a synchronized fashion implies a locking hierarchy, and there's nothing here that documents such a locking hierarchy.

The obvious example is if there's (say) a current chunk of code that adds a new Node, and which does so by first adding to node_infos then adds to node_instances. If someone later adds different code that happens to add things to node_instances before node_infos, then there's a deadlock.

If all related state – here, everything that's indexed by NodeId – is under the same lock, then there's no trap for unwary maintainers in the future.

But the point of this PR is exactly to decouple those two fields, so that they can be updated independently (no need to ever hold a lock over both). I have added a helper method add_node_instance to make things clearer now, and added more comments. PTAL.

daviddrysdale · 2020-03-31T08:02:06Z

oak/server/rust/oak_runtime/src/node/mod.rs

@@ -29,6 +29,10 @@ mod wasm;
 /// Nodes must not do any work until the [`Node::start`] method is invoked.
 pub trait Node: Send + Sync {
    /// Starts executing the node.
+    ///
+    /// This method may block, but node implementations should make sure that they create separate
+    /// background threads to execute actual work, and wait on them to terminate when [`Node::stop`]


I didn't really understand this comment. Why would start() block if the work has to be done in a separate thread?

You are right, that was quite confusing, I hope I clarified now, PTAL.

daviddrysdale · 2020-03-31T08:07:40Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

@@ -223,12 +220,12 @@ impl Runtime {
            return Ok(());
        }

-        let nodes = self.nodes.read().unwrap();
+        let node_infos = self.node_infos.read().unwrap();


(pre-existing) This lock scope extends to the end of the method, which is needed because of the use of the NodeInfo that's owned by the node_infos field. I wonder if a with_node_info(fn) accessor might make this pattern clearer/safer?

What would we provide as a function in that case? If you are talking just about extracting tracked_handles, I don't think that would improve things, we would still either return a reference, or have to clone it. I am probably missing your point though, could you clarify?

I was thinking about something analogous to the ChannelMapping::with_channel method; the function would just be the code in the rest of the block, but with the scope of lock & reference ownership clearer:

fn with_node_info<U, F: FnOnce(&NodeInfo) -> U>(&self, node_id: NodeId, f: F) -> U { let node_infos = self.node_infos.read().unwrap(); let node_info = node_infos .get(&node_id) .expect("Invalid node_id passed into with_node_info"); f(node_info) } /// Validate the [`NodeId`] has access to [`Handle`], returning `Err(OakStatus::ErrBadHandle)` /// if access is not allowed. fn validate_handle_access(&self, node_id: NodeId, handle: Handle) -> Result<(), OakStatus> { // Allow RUNTIME_NODE_ID access to all handles. if node_id == RUNTIME_NODE_ID { return Ok(()); } self.with_node_info(node_id, |node_info| { let tracked_handles = node_info.handles.lock().unwrap(); // Check the handle exists in the handles associated with a node, otherwise // return ErrBadHandle. if tracked_handles.contains(&handle) { Ok(()) } else { error!( "validate_handle_access: handle {:?} not found in node {:?}", handle, node_id ); Err(OakStatus::ErrBadHandle) } }) }

Probably for another day though (maybe under #779)

This seems like it is just re-implementing lexical scoping? I agree with the sentiment, but I think we should instead be more conscious of how we scope variables (especially lock guards), possibly by adding explicit anonymous blocks to delimit scopes (with appropriate comments). But happy to keep discussing separately.

@daviddrysdale what do you think about going even further and turning even e.g. let tracked_handles = node_info.handles.lock().unwrap(); into closure passing style (e.g. node_info.with_handles(|handles|{handles.containts(handle)})). Some consistency around locking patterns might be nice, but I'm not sure if its worth going this far?

I'm also interested if scoping locks under lifetimes could be more ergonomic, something like:

pub fn with_all_channels<'a>( &'a self, ) -> std::sync::RwLockReadGuard<'a, HashMap<ChannelId, Channel>> { self.channels.read().unwrap() } // this would need something like MappedRwLockReadGuard from lock_api: https://docs.rs/lock_api/0.3.3/lock_api/struct.MappedRwLockReadGuard.html pub fn with_channel<'a>( &'a self, channel_id: ChannelId, ) -> MappedRwLockReadGuard<'a, Result<Channel, OakStatus>> { self.channels.read().unwrap().map(|channels|{ channels.get(&channel_id).ok_or(OakStatus::ErrBadHandle) } }

Although this would reintroduce lock scopes being overly large if not careful.(edit: lexical scoping)

rbehjati · 2020-03-31T11:33:51Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

+        self.track_handles_in_node(node_reference, initial_handles);
+
+        // Try to start the node instance, and return any errors.
+        let result = node_instance.start();


The general advice for writing clean code is that concurrency is a separate concern and usually deserves to be handle in a separate class (a dedicated data structure). I am not sure if this advice is applicable to Rust code, but in any other programming language I've used, I'd write a wrapper class to hold node_infos and node_instances with high level methods for get, insert, remove, etc. All the concurrency logic (including any necessary syncing between node_infos and node_instances) would be handled in that class. The only problem I currently see with this approach is that if node_instance.start() must absolutely be called in between updating node_infos and updating node_instances. Is that the case?

The main point of this PR is so that the node instance can be started so that the relevant data structures (NodeInfo in this case) are available to the runtime. I agree with the desire of encapsulating any synchronization in high level accessors, but I'm not entirely sure how to go about it here, perhaps let's discuss separately? If you have any idea in mind and you would like to prototype it in a PR, that would be great!

Sure. Let's discuss it separately. What I am thinking of is just a bit of refactoring that can be done separately, after you are done with this PR. And I think it might address some of the comments from @daviddrysdale too (but I am not a true Rustacean yet, so I could be wrong).

rbehjati · 2020-03-31T11:36:12Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

+        let result = node_instance.start();
+
+        // Regardless of the result of `Node::start`, insert the now running instance to the list of
+        // running instances, so thatt `Node::stop` will be called on it eventually.


Suggested change

// running instances, so thatt `Node::stop` will be called on it eventually.

// running instances, so that `Node::stop` will be called on it eventually.

This also allows to write tests for the newly created `node_start_instance` function, extracted from `node_create`. Fixes project-oak#788

tiziano88

Thanks for the comments!

tiziano88 · 2020-03-31T12:34:33Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

@@ -223,12 +220,12 @@ impl Runtime {
            return Ok(());
        }

-        let nodes = self.nodes.read().unwrap();
+        let node_infos = self.node_infos.read().unwrap();


This seems like it is just re-implementing lexical scoping? I agree with the sentiment, but I think we should instead be more conscious of how we scope variables (especially lock guards), possibly by adding explicit anonymous blocks to delimit scopes (with appropriate comments). But happy to keep discussing separately.

tiziano88 · 2020-03-31T12:36:23Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

+        let result = node_instance.start();
+
+        // Regardless of the result of `Node::start`, insert the now running instance to the list of
+        // running instances, so thatt `Node::stop` will be called on it eventually.


tiziano88 · 2020-03-31T12:38:14Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

+        self.track_handles_in_node(node_reference, initial_handles);
+
+        // Try to start the node instance, and return any errors.
+        let result = node_instance.start();


The main point of this PR is so that the node instance can be started so that the relevant data structures (NodeInfo in this case) are available to the runtime. I agree with the desire of encapsulating any synchronization in high level accessors, but I'm not entirely sure how to go about it here, perhaps let's discuss separately? If you have any idea in mind and you would like to prototype it in a PR, that would be great!

tiziano88 · 2020-03-31T12:52:26Z

oak/server/rust/oak_runtime/src/runtime/mod.rs

+    node_infos: RwLock<HashMap<NodeId, NodeInfo>>,
+
+    /// Currently running node instances, so that [`Runtime::stop`] can terminate all of them.
+    node_instances: Mutex<HashMap<NodeId, Box<dyn crate::node::Node>>>,


But the point of this PR is exactly to decouple those two fields, so that they can be updated independently (no need to ever hold a lock over both). I have added a helper method add_node_instance to make things clearer now, and added more comments. PTAL.

daviddrysdale

No need for the discussions on lock scoping and granularity to hold this up

tiziano88 requested review from daviddrysdale, blaxill and rbehjati March 30, 2020 21:16

googlebot added the cla: yes label Mar 30, 2020

blaxill approved these changes Mar 30, 2020

View reviewed changes

blaxill mentioned this pull request Mar 30, 2020

race condition when starting newly created nodes #788

Closed

daviddrysdale reviewed Mar 31, 2020

View reviewed changes

tiziano88 force-pushed the tzn_create_node_run branch from dc73c44 to 57bcd7c Compare March 31, 2020 10:57

tiziano88 requested a review from daviddrysdale March 31, 2020 11:04

rbehjati approved these changes Mar 31, 2020

View reviewed changes

Split node initialization and start

470d845

This also allows to write tests for the newly created `node_start_instance` function, extracted from `node_create`. Fixes project-oak#788

tiziano88 force-pushed the tzn_create_node_run branch from 57bcd7c to 470d845 Compare March 31, 2020 12:53

tiziano88 commented Mar 31, 2020

View reviewed changes

daviddrysdale approved these changes Mar 31, 2020

View reviewed changes

tiziano88 merged commit d742461 into project-oak:master Mar 31, 2020

tiziano88 deleted the tzn_create_node_run branch March 31, 2020 18:45

daviddrysdale mentioned this pull request Apr 1, 2020

Document/police/simplify locking hierarchy in Rust runtime #779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split node initialization and start #789

Split node initialization and start #789

tiziano88 commented Mar 30, 2020

blaxill left a comment

tiziano88 commented Mar 30, 2020

daviddrysdale Mar 31, 2020

tiziano88 Mar 31, 2020

daviddrysdale Mar 31, 2020

tiziano88 Mar 31, 2020

daviddrysdale Mar 31, 2020

tiziano88 Mar 31, 2020

daviddrysdale Mar 31, 2020

tiziano88 Mar 31, 2020

daviddrysdale Mar 31, 2020

tiziano88 Mar 31, 2020

blaxill Mar 31, 2020 •

edited

Loading

rbehjati Mar 31, 2020

tiziano88 Mar 31, 2020

rbehjati Mar 31, 2020

rbehjati Mar 31, 2020

tiziano88 Mar 31, 2020

tiziano88 left a comment

tiziano88 Mar 31, 2020

tiziano88 Mar 31, 2020

tiziano88 Mar 31, 2020

tiziano88 Mar 31, 2020

daviddrysdale left a comment

	// running instances, so thatt `Node::stop` will be called on it eventually.
	// running instances, so that `Node::stop` will be called on it eventually.

Split node initialization and start #789

Split node initialization and start #789

Conversation

tiziano88 commented Mar 30, 2020

Checklist

blaxill left a comment

Choose a reason for hiding this comment

tiziano88 commented Mar 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blaxill Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiziano88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviddrysdale left a comment

Choose a reason for hiding this comment

blaxill Mar 31, 2020 •

edited

Loading