Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split node initialization and start #789

Merged
merged 1 commit into from
Mar 31, 2020

Conversation

tiziano88
Copy link
Collaborator

This also allows to write tests for the newly created
node_start_instance function, extracted from node_create.

Fixes #788

Checklist

  • Pull request affects core Oak functionality (e.g. runtime, SDK, ABI)
    • I have written tests that cover the code changes.
  • Pull request includes prototype/experimental work that is under
    construction.

Copy link
Contributor

@blaxill blaxill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Out of curiosity is the ABI node_create gaining the ability to start with multiple initial handles, or is the multiple initial handles in node_start_instance going to be internal only?

@tiziano88
Copy link
Collaborator Author

Lgtm. Out of curiosity is the ABI node_create gaining the ability to start with multiple initial handles, or is the multiple initial handles in node_start_instance going to be internal only?

Interesting point, I was only thinking of it as internal only, but it may make sense to expose as part of node_create too actually! But let's look into it separately.

node_infos: RwLock<HashMap<NodeId, NodeInfo>>,

/// Currently running node instances, so that [`Runtime::stop`] can terminate all of them.
node_instances: Mutex<HashMap<NodeId, Box<dyn crate::node::Node>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two fields look like they should be covered by the same Mutex.

(In C++ this would probably be a base class that holds data (≈ NodeInfo), but which still has pure virtual functions to be overridden (≈ node::Node). Not completely sure what the Rust equivalent would be – maybe to extend the node::Node trait to cover behaviour encompassed by what's in NodeInfo, and then each implementor of the trait has-a NodeInfo for that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the problem with the current split, do you have an example of something that may go wrong in this case? (perhaps I'll even add a test to make sure it does not happen, if it's doable)

Anyways, I suspect what you are suggesting is actually similar to the previous version, in which Node (now NodeInfo) had an instance field itself, but to me, separating them this way makes Runtime::stop cleaner (assuming it's not incorrect).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having small granular locks for things that may need to be updated in a synchronized fashion implies a locking hierarchy, and there's nothing here that documents such a locking hierarchy.

The obvious example is if there's (say) a current chunk of code that adds a new Node, and which does so by first adding to node_infos then adds to node_instances. If someone later adds different code that happens to add things to node_instances before node_infos, then there's a deadlock.

If all related state – here, everything that's indexed by NodeId – is under the same lock, then there's no trap for unwary maintainers in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the point of this PR is exactly to decouple those two fields, so that they can be updated independently (no need to ever hold a lock over both). I have added a helper method add_node_instance to make things clearer now, and added more comments. PTAL.

@@ -29,6 +29,10 @@ mod wasm;
/// Nodes must not do any work until the [`Node::start`] method is invoked.
pub trait Node: Send + Sync {
/// Starts executing the node.
///
/// This method may block, but node implementations should make sure that they create separate
/// background threads to execute actual work, and wait on them to terminate when [`Node::stop`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really understand this comment. Why would start() block if the work has to be done in a separate thread?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, that was quite confusing, I hope I clarified now, PTAL.

@@ -223,12 +220,12 @@ impl Runtime {
return Ok(());
}

let nodes = self.nodes.read().unwrap();
let node_infos = self.node_infos.read().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pre-existing) This lock scope extends to the end of the method, which is needed because of the use of the NodeInfo that's owned by the node_infos field. I wonder if a with_node_info(fn) accessor might make this pattern clearer/safer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we provide as a function in that case? If you are talking just about extracting tracked_handles, I don't think that would improve things, we would still either return a reference, or have to clone it. I am probably missing your point though, could you clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something analogous to the ChannelMapping::with_channel method; the function would just be the code in the rest of the block, but with the scope of lock & reference ownership clearer:

    fn with_node_info<U, F: FnOnce(&NodeInfo) -> U>(&self, node_id: NodeId, f: F) -> U {
        let node_infos = self.node_infos.read().unwrap();
        let node_info = node_infos
            .get(&node_id)
            .expect("Invalid node_id passed into with_node_info");
        f(node_info)
    }

    /// Validate the [`NodeId`] has access to [`Handle`], returning `Err(OakStatus::ErrBadHandle)`
    /// if access is not allowed.
    fn validate_handle_access(&self, node_id: NodeId, handle: Handle) -> Result<(), OakStatus> {
        // Allow RUNTIME_NODE_ID access to all handles.
        if node_id == RUNTIME_NODE_ID {
            return Ok(());
        }

        self.with_node_info(node_id, |node_info| {
            let tracked_handles = node_info.handles.lock().unwrap();

            // Check the handle exists in the handles associated with a node, otherwise
            // return ErrBadHandle.
            if tracked_handles.contains(&handle) {
                Ok(())
            } else {
                error!(
                    "validate_handle_access: handle {:?} not found in node {:?}",
                    handle, node_id
                );
                Err(OakStatus::ErrBadHandle)
            }
        })
    }

Probably for another day though (maybe under #779)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is just re-implementing lexical scoping? I agree with the sentiment, but I think we should instead be more conscious of how we scope variables (especially lock guards), possibly by adding explicit anonymous blocks to delimit scopes (with appropriate comments). But happy to keep discussing separately.

Copy link
Contributor

@blaxill blaxill Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daviddrysdale what do you think about going even further and turning even e.g. let tracked_handles = node_info.handles.lock().unwrap(); into closure passing style (e.g. node_info.with_handles(|handles|{handles.containts(handle)})). Some consistency around locking patterns might be nice, but I'm not sure if its worth going this far?

I'm also interested if scoping locks under lifetimes could be more ergonomic, something like:

    pub fn with_all_channels<'a>(
        &'a self,
    ) -> std::sync::RwLockReadGuard<'a, HashMap<ChannelId, Channel>> {
        self.channels.read().unwrap()
    }

    // this would need something like MappedRwLockReadGuard from lock_api: https://docs.rs/lock_api/0.3.3/lock_api/struct.MappedRwLockReadGuard.html
    pub fn with_channel<'a>(
        &'a self,
        channel_id: ChannelId,
    ) -> MappedRwLockReadGuard<'a, Result<Channel, OakStatus>> {
        self.channels.read().unwrap().map(|channels|{
            channels.get(&channel_id).ok_or(OakStatus::ErrBadHandle)
        }
    }

Although this would reintroduce lock scopes being overly large if not careful.(edit: lexical scoping)

self.track_handles_in_node(node_reference, initial_handles);

// Try to start the node instance, and return any errors.
let result = node_instance.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general advice for writing clean code is that concurrency is a separate concern and usually deserves to be handle in a separate class (a dedicated data structure). I am not sure if this advice is applicable to Rust code, but in any other programming language I've used, I'd write a wrapper class to hold node_infos and node_instances with high level methods for get, insert, remove, etc. All the concurrency logic (including any necessary syncing between node_infos and node_instances) would be handled in that class. The only problem I currently see with this approach is that if node_instance.start() must absolutely be called in between updating node_infos and updating node_instances. Is that the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of this PR is so that the node instance can be started so that the relevant data structures (NodeInfo in this case) are available to the runtime. I agree with the desire of encapsulating any synchronization in high level accessors, but I'm not entirely sure how to go about it here, perhaps let's discuss separately? If you have any idea in mind and you would like to prototype it in a PR, that would be great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's discuss it separately. What I am thinking of is just a bit of refactoring that can be done separately, after you are done with this PR. And I think it might address some of the comments from @daviddrysdale too (but I am not a true Rustacean yet, so I could be wrong).

let result = node_instance.start();

// Regardless of the result of `Node::start`, insert the now running instance to the list of
// running instances, so thatt `Node::stop` will be called on it eventually.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// running instances, so thatt `Node::stop` will be called on it eventually.
// running instances, so that `Node::stop` will be called on it eventually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This also allows to write tests for the newly created
`node_start_instance` function, extracted from `node_create`.

Fixes project-oak#788
Copy link
Collaborator Author

@tiziano88 tiziano88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

@@ -223,12 +220,12 @@ impl Runtime {
return Ok(());
}

let nodes = self.nodes.read().unwrap();
let node_infos = self.node_infos.read().unwrap();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is just re-implementing lexical scoping? I agree with the sentiment, but I think we should instead be more conscious of how we scope variables (especially lock guards), possibly by adding explicit anonymous blocks to delimit scopes (with appropriate comments). But happy to keep discussing separately.

let result = node_instance.start();

// Regardless of the result of `Node::start`, insert the now running instance to the list of
// running instances, so thatt `Node::stop` will be called on it eventually.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.track_handles_in_node(node_reference, initial_handles);

// Try to start the node instance, and return any errors.
let result = node_instance.start();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of this PR is so that the node instance can be started so that the relevant data structures (NodeInfo in this case) are available to the runtime. I agree with the desire of encapsulating any synchronization in high level accessors, but I'm not entirely sure how to go about it here, perhaps let's discuss separately? If you have any idea in mind and you would like to prototype it in a PR, that would be great!

node_infos: RwLock<HashMap<NodeId, NodeInfo>>,

/// Currently running node instances, so that [`Runtime::stop`] can terminate all of them.
node_instances: Mutex<HashMap<NodeId, Box<dyn crate::node::Node>>>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the point of this PR is exactly to decouple those two fields, so that they can be updated independently (no need to ever hold a lock over both). I have added a helper method add_node_instance to make things clearer now, and added more comments. PTAL.

Copy link
Contributor

@daviddrysdale daviddrysdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the discussions on lock scoping and granularity to hold this up

@tiziano88 tiziano88 merged commit d742461 into project-oak:master Mar 31, 2020
@tiziano88 tiziano88 deleted the tzn_create_node_run branch March 31, 2020 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race condition when starting newly created nodes
5 participants