Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasi-io: Reimplement wasi-io/poll using a Pollable trait #7812

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

badeend
Copy link
Contributor

@badeend badeend commented Jan 24, 2024

Prior discussion: https://bytecodealliance.zulipchat.com/#narrow/stream/217126-wasmtime/topic/Change.20Subscribe.20trait

  • Renamed the existing Pollable struct to PollableResource
  • Reimplemented wasi-io/poll. This introduces a new Pollable trait which is lower level, doesn't require heap allocations to poll, has mutable access to the WasiView, and can be used as a standalone resource without a parent. The Subscribe trait is kept intact, but this is now a utility interface, implemented in terms of Pollable.
  • Eliminate the (now) unnecessary surrogate parent resource of clock pollables
  • Added ResourceTable take & restore as a general purpose replacement for iter_entries. That one was used only by the old poll implementation.

Additionally:

… Preview2 resources. Removed the _mut suffixes to align with WasiHttpView.
…er they were preopened or opened using open_at. This fixes build errors regarding overlapping mutable lifetimes introduced in the previous commit.
…is lower level, doesn't require heap allocations to poll, has mutable access to the WasiView, and can be used as a standalone resource without a parent. The Subscribe trait is kept intact, but this is now a utility interface, implemented in terms of Pollable.
…` implementation. And its is now superseded by take&restore
@badeend badeend requested review from a team as code owners January 24, 2024 20:15
@badeend badeend requested review from fitzgen and removed request for a team January 24, 2024 20:15
@badeend badeend mentioned this pull request Jan 24, 2024
@github-actions github-actions bot added wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself labels Jan 24, 2024
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasi", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@fitzgen fitzgen requested review from sunfishcode and removed request for fitzgen January 24, 2024 22:01
@pchickey pchickey self-requested a review January 24, 2024 22:51
Copy link
Contributor

@pchickey pchickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is excellent work. The new internal interface is definitely superior to the old one, and I appreciate the new tests as well. The lease system for resource table is a much better interface than iter_entries.

I believe that all of the Slot and SlotIdentity implementation makes sense and is correct (especially because, surprisingly to me, the only unsafe is in unsafe impl Send), but I wanted to tag @alexcrichton to double-check that part because I do not feel super confident in my ability to assess that code. If he is happy with that, this can land.

/// value, this function traps.
/// This function traps if either:
/// - the list is empty, or:
/// - the list contains more elements than can be indexed with a `u32` value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need this change in the wits, lets just be sure to upstream these to the spec repo as well. We will come up with some process for how we keep the docs evolving and improving while assuring that the interface itself doesn't change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok(table.push_child(pollable, &resource)?)
}

/// A host representation of the `wasi:io/poll.pollable` resource.
pub struct PollableResource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very bikesheddy suggestion, so feel free to disregard it, but is BoxPollable a better name for this? That way Resource<PollableResource> isnt repeating the word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design has changed in the meantime. Now it's not just a box anymore, so I went with PollableHandle.

@pchickey pchickey requested review from alexcrichton and removed request for sunfishcode January 25, 2024 00:08
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I've left a few comments but it's getting a bit later here so I'm going to head out. I want to read more about the Lease<T> though as that looks quite subtle and I need to think more about it.

Comment on lines 40 to 44
impl<T: Subscribe> Pollable for T {
fn poll_ready(&mut self, cx: &mut Context<'_>, _view: &mut dyn WasiView) -> Poll<()> {
self.ready().as_mut().poll(cx)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think is actually subtly incorrect because it drops the future after the call to poll which signals that the future should be cancelled rather than keeping it alive until the whole poll is done. I think that means that we may not be guaranteed to get wakeups from cancelled futures, although we might get those for now given how the code is currently constructed. I think that this'll need to be a bit fancier with the adapter instead of having a blanket impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests were failing precisely because of that. To fix it, I ended up rougly back at the Subscribe design, but this time returning a custom WasiFuture type that has an additional WasiView parameter on its poll method. To prevent scope creep of this PR, I kept Subscribe alive for now (as PollableAsync). But in the long run, I don't think there's much value in havin them both, as Subscribe can be trivially converted to Pollable from:

#[async_trait::async_trait]
impl PollableAsync for HostFutureIncomingResponse {
    async fn ready(&mut self) {
        if let Self::Pending(handle) = self {
            *self = Self::Ready(handle.await);
        }
    }
}

to:

impl Pollable for HostFutureIncomingResponse {
    fn ready<'a>(&'a mut self) -> Pin<Box<dyn WasiFuture<Output = ()> + Send + 'a>> {
        Box::pin(async {
            if let Self::Pending(handle) = self {
                *self = Self::Ready(handle.await);
            }
        })
    }
}

One obvious downside is the added visual noise.

crates/wasi/tests/all/sync.rs Outdated Show resolved Hide resolved
crates/wasi/src/preview2/poll.rs Outdated Show resolved Hide resolved
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've gotten a chance now to take a closer look at Lease<T> and the changes ResourceTable. Given this new API design for the pollable trait something along these lines is required (e.g. iter_children can't work any more). The specific implementation here I think has a drawback where it's very "panicky" if you get it wrong. Most other aspects of WASI are "error-y" in that they try to return traps if anything is gotten wrong. This I think is pretty important for not accidentally becoming a DoS vector for embeddings. For example a panicking Drop implementation means that an early-return in an embedder function might accidentally take down the whole process where a wasm trap would only take down a single instance.

Given that my main thought on this is that this should ideally not ever panic and instead should switch to returning errors where possible. I might also recommending going a little bit further perhaps with a scheme such as:

  • Leave TableEntry::entry as Box<dyn Any>.
  • Change take to returning Box<T>. This would replace entry with something like Box::new(Tombstone) which is a private type to this module.
  • Change restore to taking Box<T> and Resource<T>.

That way it's largely up to embedders to "get everything right" but they'd already be required to do so with this current API. Additionally any failed downcasts can additionally add a check for Tombstone to perhaps return a more precise error other than ResourceTableError::WrongType with a new variant such as TakenValue.

@badeend
Copy link
Contributor Author

badeend commented Jan 26, 2024

Thanks for the feedback.
I agree on the "panicky" point. I'll add an error type and remove the panics.

One thing that's not in this PR, but I assume will most likely be added at some point, are untyped take_any and restore_any variants. The drawback of reverting the Lease & SlotIdentity design, is that the restore(_any) API becomes (even) easier to misuse. Because then the consumer can restore any value at the index of a previously differently typed entry. I'm worried about the developer experience of this, as the corruption would happen silently and the place where they encounter the WrongType errors could be miles away from where the problem actually is.

Anyway, I'm fine with your suggestions. I just wanted to make sure the trade-offs are known.

@alexcrichton
Copy link
Member

Hm ok, for that I think you have a good point about accidentally messing up these APIs. I think this may still be surmountable perhaps with some trickery, but I'd also need to see the usage of take_any and restore to know better. Want to discuss on a future PR with that implemented or hash it out here? (I'm fine either way)

@badeend
Copy link
Contributor Author

badeend commented Feb 1, 2024

I chose to go with a hybrid approach. For the public API, I changed it to what you suggested. Internally, I removed Lease & SlotIdentity. But I kept Slot to perform the resource type check. Also, as part of the updated design (see above) I needed take_any and restore_any so I included those as well.

@alexcrichton
Copy link
Member

Reading over this and see how this all turned out, I'm personally starting to get second thoughts on this. We're effectively reimplementing our own Future trait and as I'm sure you've seen we start implementing our own primitive functions (e.g. poll_fn) as well as we can't use standard things like async fn or #[async_trait]. I'm a bit worried that the direction this is taking us is straying off the path of maintainability for async support as things get more advanced over time.

Now that's all easy to say but this PR is still solving a concrete problem which is letting implementations access resources while polling, so I don't think simply closing this PR is an option. That being said after having read over this I wonder if there's perhaps an alternative implementation route that we can take.

Originally when designing Future-the-trait we ran into this issue of situations wanting to pass more context along through the poll method but the context doesn't survive longer than a single call to poll. To do that we ended up creating task-local variables which are like thread locals but instead stick with a task. That doesn't solve the immediate problem at hand though since you want mutable access, not just readable access.

To solve the mutability problem I realized that the take/restore bits look like Option<T> and so they've already got runtime state associated with them. One alternative would be to use a RefCell<T> instead and effectively repurpose that runtime state. That would enable acquiring &mut T from &ResourceTable so long as it's done "correctly" which is basically already the situation we have today (make sure you restore after you take).

How would you feel about something like that? We could still preserve get_mut as a method which has no runtime overhead (apart from storage space) but I'm imagining that a borrow_mut() method would be added. I'll note that get would have to go away in this world and be replaced with borrow() as a consequence, which likely affects code we have today.

While RefCell is unlikely to win any award for being the most ergonomic thing in the world this feels like it might provide a better tradeoff because we wouldn't fall off the well-trodden-path of Rust async into custom traits and such. I would want to make sure it works for your use case though.

I also realize though that you've probably already put in a great deal of work to this PR with 2 versions now so I'm hesitant to ask for a third. I'd be happy to help sketch this out and do some of the refactoring work to see if I feel like it's going to pay off.

@badeend
Copy link
Contributor Author

badeend commented Feb 5, 2024

I think you mean changing

async fn ready(&mut self);

to

async fn ready(&mut self, table: &ResourceTable);

right?

That doesn;t work because &ResourceTable is not Send as ResourceTable is not Sync.

@alexcrichton
Copy link
Member

Good point, yes, I'm more-or-less saying we should do that. (either that or use a task-local but I think that still captures &T).

Mind trying make ResourceTable implement Sync? I think that's probably the addition of a few trait bounds in its internal trait objects. I think everything we put in there is already Sync although if it things aren't currently Sync that'll pose a larger problem.

@badeend
Copy link
Contributor Author

badeend commented Feb 11, 2024

I understand your concerns, yet I'd rather not go for round three right now, which would include reverting #7802. So instead, I've changed the questionable types to be private to the poll.rs module. That way, all the iffy-ness is contained to just a single file that we can iterate on later. From the outside nothing significant has changed, except that now I can use poll_ready_fn, which is what I personally was after.

Hope that's OK for you

@alexcrichton
Copy link
Member

Wanted to say I have not forgotten about this, I have been looking for time to write up something longer-form, which I hope to get to by tomorrow. Is this blocking anything though that it would be prudent to land now rather than later? If so I think it's good to go as-is, but otherwise I'd like to take some more time to write up longer-form thoughts.

@badeend
Copy link
Contributor Author

badeend commented Feb 13, 2024

There's no immediate rush from my side, so feel free to take your time.

@alexcrichton
Copy link
Member

Ok thanks again for your patience here, very much appreciated!

I've gotten some time to think and work on this. I was leaning towards merging this, but then I realized that I'd prefer to avoid a situation where we land this and then later revert most of it towards a different strategy. In that sense I wanted, time permitting, to take a moment and figure out if alternative strategies would work. I'm getting a growing sense of unease with this direction as it's more-or-less a custom Future trait and is something we'd ideally avoid.

So assuming that the main goal of this PR is to get access to ResourceTable during async fn ready I originally suggested the borrow/borrow_mut idea above using RefCell. I tried implementing that and turns out it doesn't work. That means that ResourceTable contains RefCell and async fn ready would close over &ResourceTable (e.g. it's a new function argument). In such a situation it means that the returned future, which must be Send, closes over &ResourceTable. That type is not Send because it requires ResourceTable: Sync which is not satisfied with RefCell. So that cans the idea of. using RefCell.

After talking a bit more with @pchickey, however, I'm growing more fond of the idea of using RwLock<T> here instead of RefCell<T>. Not for the actual blocking aspect but instead only for the "it's Sync" aspect. To that end I implemented this on a branch and got tests passing with it. The changes are:

  • Replace ResourceTable::get with ResourceTable::{borrow, borrow_mut} that return RwLock{Read,Write}Guard<T>.
  • Remove table methods returning Any
  • Add &ResourceTable as an argument to the ready async function.
  • Replace most usage of table().get() with table().get_mut() (avoids locks)
  • Use u32 indices in Pollable's make_future internals instead of Any
  • Rewrite headers in wasi-http to avoid needing Any by representing headers as Resource<Resource<hyper::HeaderMap>>

The major consequences of this decision, however, are:

  • ResourceTable::{borrow, borrow_mut} require atomic manipulations. No blocking, but it's atomics for something that's not contended 99.9% of the time.
  • The std::sync::RwLock type cannot be used because std::sync::RwLock{Read,Write}Guard is not Send. I temporarily added a tokio dependency to wasmtime-the-crate and used tokio::sync::RwLock instead. Long-term I would like to avoid a tokio dep in the wasmtime crate.
  • There's a few minor cleanup still to be had in terms of threading a few more errors in a few more places.

Personally I'm inclined to take a route that looks like this, namely threading arguments through async fn rather than threading arguments through fn poll. This is a foundational change to how things work though, especially around a new footgun of not being able to borrow_mut twice. In that sense I'd like to get feedback along the lines of:

  • @pchickey does this all sound reasonable enough to you?
  • @badeend does this still solve your original use case, and if so what do you think about this approach vs the poll approach?

@alexcrichton
Copy link
Member

also cc @elliottt since you've touched a lot of WASI internals and you probably want to take a look too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify poll with empty list
3 participants