Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for MLS state machine #303

Merged
merged 10 commits into from
Nov 2, 2023

Conversation

richardhuaaa
Copy link
Contributor

@richardhuaaa richardhuaaa commented Nov 1, 2023

Will try to keep these suggestion PR's small, for this one:

  1. Serialize welcome messages to post_commit_data field on intent rather than separate table
  2. Remove some group sync steps (we can handle conflicts anyway)
  3. EDIT: Drop pending proposals for now
  4. Describe how payloads can be synced in a non-duplicative way

Other things worth considering:

  1. Making some of the steps generic rather than specific to intent type (a lot of it is duplicative/identical)
  2. Whether to make commit processing strictly sequential (commit must be fully done before next one begins) or concurrent. If we make it strictly sequential, we need to think about how to handle concurrent syncs (as the application also needs to sync incoming messages, either via pull or subscription)

@richardhuaaa
Copy link
Contributor Author

richardhuaaa commented Nov 1, 2023

AFAIK, MLS lets you define the logic for which proposals are accepted or rejected. Another possibility could be to mandate that each client will only commit its own proposals. OpenMLS has a remove_pending_proposal method that we could use instead of committing pending proposals

xmtp_mls/README.md Outdated Show resolved Hide resolved
1. Sequentially process each payload. For each payload, update the `last_message_timestamp_ns` and any corresponding database writes for that payload in a single transaction
1. When the sync is complete, release the lock by setting `lock_until_ns` to 0

This flow will be similar regardless of if the sync happens via a poll-based or subscription-based mechanism. For a subscription-based mechanism, the lock will be obtained at the start of the subscription, and extended on a heartbeat time interval, until the subscription is closed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of really robust subscriptions (which I think is going to be quite hard to implement for V1) my recommendation would be to not adjust any locks on the subscription and allow state syncs to happen in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subscriptions are not robust, I feel like it causes a lot of downstream problems.

  1. If a commit message is lost, subsequent payloads are unreadable. You can trigger a pull at this point, but it's added complexity.
  2. If we don't lock syncs while subscriptions are in progress, then we have concurrent processing of the same payloads, meaning one side or the other will fail. It becomes unclear if failures are due to bad payloads/missing epoch messages, or if they are due to parallel processing.
  3. It's not the best dev experience for consumers of the SDK.

I have some suggestions:

  1. We could make subscriptions more robust. We could pass the last_synced_payload_timestamp to the server when initiating subscriptions. Alternatively, we could have the server record the previous payload timestamp on each payload, so that the client can detect when it is missing payloads and re-sync.
  2. We could omit subscriptions from the initial MLS milestones and fix them up in later milestones.

Copy link
Contributor

@neekolas neekolas Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could omit subscriptions from the initial MLS milestones and fix them up in later milestones.

We can talk to developers and see how much of a deal-breaker that would be. They'd also have to be OK with no push notifications. Together, that sounds like a tough sell, but I could be wrong.

We could make subscriptions more robust. We could pass the last_synced_payload_timestamp to the server when initiating subscriptions. Alternatively, we could have the server record the previous payload timestamp on each payload, so that the client can detect when it is missing payloads and re-sync.

Even if we had the most robust streaming APIs, we also need to somehow deal with push notifications which are even less reliable and have less flexibility. The last_synced_payload_timestamp isn't really an option for push. Having the server record the previous payload timestamp on each payload would be possible in our StreamAllMessages API which powers push. Would require some changes to our back-end to include it in the message that gets gossiped in Waku. It also requires a lookup on each publish, which is probably an acceptable performance cost.

All options feel like some variation of: "some streamed messages are just a reminder to query the topic and sync from the last timestamp of our contiguous message history". There are degrees of this.

  1. Every subscription message or push notification just tells the client to sync via the query API. We don't even look at the payloads.
  2. Application messages can be processed directly from a stream/push notification if you are in the correct epoch. We inspect the payload and if it is a Commit/Proposal we abort processing the streamed message (and roll back the database changes that deletes the decryption keys) and sync from our last known timestamp via the query API
  3. Any streamed message or push notification where the previous_payload_timestamp lines up with our local state gets processed immediately. If we haven't processed the message with the previous_payload_timestamp we do a sync from the network from the last synced payload timestamp.

The third option likely gives us the fewest queries to the network depending on the % of out-of-order messages received via streaming. But it is a bit of a pain to roll out. We'd need to not just update our nodes and Envelope proto schema but also the example-notification-server. And anyone using MLS would have to update their notification receiving code to handle the new field. Maybe it's worth the cost, but we should take the cost into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the third option. I hadn't considered adding the previous_payload_timestamp to all our streamed envelopes.

Given that all options look like option 1 in some cases, I wonder if we start with that to unblock everything and give us streaming methods quite easily. Then we can layer on option 2 or 3 depending on how we feel about the scope of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good framing, thanks for writing it up. Okay, this makes sense to me!

xmtp_mls/README.md Outdated Show resolved Hide resolved
@neekolas
Copy link
Contributor

neekolas commented Nov 2, 2023

OpenMLS has a remove_pending_proposal method that we could use instead of committing pending proposals

That would allow us to check the box of "supporting proposals" in the short term, without actually having to support them. We'd have to think through what it would look like to change this later. Maybe having a single member of the group on a newer version that didn't remove the proposals, but instead created a commit, would be fine. Would just require a bit of investigation.

@neekolas
Copy link
Contributor

neekolas commented Nov 2, 2023

Whether to make commit processing strictly sequential (commit must be fully done before next one begins) or concurrent. If we make it strictly sequential, we need to think about how to handle concurrent syncs (as the application also needs to sync incoming messages, either via pull or subscription)

The simplest is to always make processing sequential. Gives us the same guarantees no matter where the message came from. Push notifications and Subscriptions that contain a commit message would just trigger a sync using the Query API that would ensure all previous commits were applied first. Application messages received via Push or Subscription are probably safe to attempt processing immediately (although they may fail if you had somehow missed a prior commit). Failed processing could be remedied with a sync-by-query.

It feels like anything else breaks the total ordering assumption baked into MLS.

We are going to need to design a solution to the total ordering problem for decentralization, but we've already decided to table that for a later iteration of the protocol.

@richardhuaaa
Copy link
Contributor Author

richardhuaaa commented Nov 2, 2023

The simplest is to always make processing sequential. Gives us the same guarantees no matter where the message came from. Push notifications and Subscriptions that contain a commit message would just trigger a sync using the Query API that would ensure all previous commits were applied first.

I think I understand - we can't process the same payload twice, so is the suggestion here to have an application sync trigger commit processing first? But what happens if you have an application sync in progress at the time that you make a commit, or halfway through the commit process?

@neekolas
Copy link
Contributor

neekolas commented Nov 2, 2023

But what happens if you have an application sync in progress at the time that you make a commit, or halfway through the commit process?

We could have some sort of locking mechanism that prevents this. These updates should be relatively fast and simultaneous syncs should be infrequent. But I don't think we need that. If a message fails to decrypt because it's already been processed and the keys have been discarded, we just move on to the next message. As long as both processes are going through messages sequentially and talking to the same DB, everything should just work right?

The system is going to need to be very resilient to decryption failures since anyone can publish junk to the group topic, plus the fact that your own messages are unreadable.

@richardhuaaa
Copy link
Contributor Author

This was a really helpful discussion. I'll try to succinctly describe the various edge cases we discussed, at least so that I understand how to implement the thing, but with that in mind I think I'm happy with this!

@richardhuaaa richardhuaaa merged commit 79c5f3e into nmolnar/mls-readme Nov 2, 2023
16 checks passed
@richardhuaaa richardhuaaa deleted the rich/state-machine-suggestions branch November 2, 2023 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants