Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for MLS state machine #303

Merged
merged 10 commits into from
Nov 2, 2023
52 changes: 31 additions & 21 deletions xmtp_mls/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ CREATE TABLE group_intents (
"kind" INT NOT NULL,
"group_id" BLOB NOT NULL,
-- Some sort of serializable blob that can be used to re-try the message if the first attempt failed due to conflict
"data" BLOB NOT NULL,
"publish_data" BLOB NOT NULL,
-- Data needed after applying a commit, such as welcome messages
"post_commit_data" BLOB NOT NULL,
richardhuaaa marked this conversation as resolved.
Show resolved Hide resolved
-- INTENT_STATE,
"state" INT NOT NULL,
-- The hash of the encrypted, concrete, form of the message if it was published.
Expand All @@ -62,21 +64,13 @@ CREATE TABLE group_intents (

CREATE INDEX group_intents_group_id_id ON group_intents(group_id, id);

CREATE TABLE outbound_welcome_messages (
-- Derived via SHA256(CONCAT(group_id, welcome_message, installation_id))
"id" BLOB PRIMARY KEY NOT NULL,
-- OUTBOUND_WELCOME_STATE
"state" INT NOT NULL,
"installation_id" BLOB NOT NULL,
-- The hash of the commit message which created this welcome
"message_hash" BLOB NOT NULL,
-- The group this welcome belongs to
"group_id" BLOB NOT NULL,
"welcome_message" BLOB NOT NULL,
FOREIGN KEY (group_id) REFERENCES groups(id)
CREATE TABLE topic_refresh_state (
"topic" TEXT PRIMARY KEY NOT NULL,
"last_message_timestamp_ns" BIGINT NOT NULL,
-- Only allow one concurrent sync at a time per topic. This value is 0 when no sync is happening
-- All locks should be cleared on cold start
"lock_until_ns" BIGINT NOT NULL
);

CREATE INDEX outbound_welcome_messages_commit_hash ON outbound_welcome_messages(commit_hash, state);
```

## Enums
Expand Down Expand Up @@ -118,35 +112,39 @@ The [following diagram](https://app.excalidraw.com/s/4nwb0c8ork7/6pPH1kQDoj3) il

![MLS State Machine](../img/mls-state-machine.png "MLS State Machine")

For the first version of MLS in XMTP, all members commit their own proposals immediately, and immediately discard any proposals from other members upon receiving them. Future versions of XMTP will have more sophisticated logic, such as batching proposals, allowing members to commit proposals from other members, as well as more sophisticated validation logic for which proposals are permitted from which members.

### Known missing items from the state machine

- Key updates
- Processing incoming welcome messages
- Tracking group membership at the account/user level
- Permissioning for adding/removing accounts/users
- Mechanism for syncing installations under each account/user

### Add members to a group

Simplified high level flow for adding members to a group:

1. Consume Key Packages for all new members
1. Create a `group_intent` for adding the members
1. Sync the state of the group with the network
1. Consume Key Packages for all new members
1. Convert the intent into concrete commit and welcome messages for the current epoch
1. Write the welcome messages to the `post_commit_data` field for later
1. Publish commit message
1. Sync the state of the group with the network
1. If no conflicts: Publish welcome messages to new members.
If conflicts: Go back to step 4 and try again
If conflicts: Go back to step 2 and try again (reset the intent's state to `TO_SEND` and clear the `publish_data` and `post_commit_data` fields)

### Remove members from a group

Simplified high level flow for removing members from a group:

1. Create a `group_intent` for removing the members
1. Sync the state of the group with the network
1. Convert the intent into concrete commit for the current epoch
1. Publish commit to the network
1. Sync the state of the group with the network
1. If no conflicts: Done.
If conflicts: Go back to step 3 and try again
If conflicts: Go back to step 2 and try again (reset the intent's state to `TO_SEND` and clear the `publish_data` and `post_commit_data` fields)

### Send a message

Expand All @@ -156,7 +154,19 @@ Simplified high level flow for sending a group message:
1. Convert the intent into a concrete message for the current epoch
1. Publish message to the network
1. Sync the state of the group with the network (can be debounced or otherwise only done periodically)
1. If no conflicts: Mark the message as committed. If conflicts: Go back to step 2.
1. If no conflicts: Mark the message as committed.
If conflicts: Go back to step 2 and try again (reset the intent's state to `TO_SEND` and clear the `publish_data` and `post_commit_data` fields)

### Syncing group state

The server maintains an inbound topic for each group, and a single inbound topic for the client's identity. For each topic, the client maintains a row that stores `last_synced_payload_id` and `lock_sync_until_ns` fields.

1. In a single transaction, validate that `lock_until_ns` is not set to a value greater than `now()`, set it to a timeframe in the future, and fetch the `last_message_timestamp_ns`
richardhuaaa marked this conversation as resolved.
Show resolved Hide resolved
1. Fetch all payloads greater than the timestamp from the server
1. Sequentially process each payload. For each payload, update the `last_message_timestamp_ns` and any corresponding database writes for that payload in a single transaction
1. When the sync is complete, release the lock by setting `lock_until_ns` to 0

This flow will be similar regardless of if the sync happens via a poll-based or subscription-based mechanism. For a subscription-based mechanism, the lock will be obtained at the start of the subscription, and extended on a heartbeat time interval, until the subscription is closed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of really robust subscriptions (which I think is going to be quite hard to implement for V1) my recommendation would be to not adjust any locks on the subscription and allow state syncs to happen in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subscriptions are not robust, I feel like it causes a lot of downstream problems.

  1. If a commit message is lost, subsequent payloads are unreadable. You can trigger a pull at this point, but it's added complexity.
  2. If we don't lock syncs while subscriptions are in progress, then we have concurrent processing of the same payloads, meaning one side or the other will fail. It becomes unclear if failures are due to bad payloads/missing epoch messages, or if they are due to parallel processing.
  3. It's not the best dev experience for consumers of the SDK.

I have some suggestions:

  1. We could make subscriptions more robust. We could pass the last_synced_payload_timestamp to the server when initiating subscriptions. Alternatively, we could have the server record the previous payload timestamp on each payload, so that the client can detect when it is missing payloads and re-sync.
  2. We could omit subscriptions from the initial MLS milestones and fix them up in later milestones.

Copy link
Contributor

@neekolas neekolas Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could omit subscriptions from the initial MLS milestones and fix them up in later milestones.

We can talk to developers and see how much of a deal-breaker that would be. They'd also have to be OK with no push notifications. Together, that sounds like a tough sell, but I could be wrong.

We could make subscriptions more robust. We could pass the last_synced_payload_timestamp to the server when initiating subscriptions. Alternatively, we could have the server record the previous payload timestamp on each payload, so that the client can detect when it is missing payloads and re-sync.

Even if we had the most robust streaming APIs, we also need to somehow deal with push notifications which are even less reliable and have less flexibility. The last_synced_payload_timestamp isn't really an option for push. Having the server record the previous payload timestamp on each payload would be possible in our StreamAllMessages API which powers push. Would require some changes to our back-end to include it in the message that gets gossiped in Waku. It also requires a lookup on each publish, which is probably an acceptable performance cost.

All options feel like some variation of: "some streamed messages are just a reminder to query the topic and sync from the last timestamp of our contiguous message history". There are degrees of this.

  1. Every subscription message or push notification just tells the client to sync via the query API. We don't even look at the payloads.
  2. Application messages can be processed directly from a stream/push notification if you are in the correct epoch. We inspect the payload and if it is a Commit/Proposal we abort processing the streamed message (and roll back the database changes that deletes the decryption keys) and sync from our last known timestamp via the query API
  3. Any streamed message or push notification where the previous_payload_timestamp lines up with our local state gets processed immediately. If we haven't processed the message with the previous_payload_timestamp we do a sync from the network from the last synced payload timestamp.

The third option likely gives us the fewest queries to the network depending on the % of out-of-order messages received via streaming. But it is a bit of a pain to roll out. We'd need to not just update our nodes and Envelope proto schema but also the example-notification-server. And anyone using MLS would have to update their notification receiving code to handle the new field. Maybe it's worth the cost, but we should take the cost into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the third option. I hadn't considered adding the previous_payload_timestamp to all our streamed envelopes.

Given that all options look like option 1 in some cases, I wonder if we start with that to unblock everything and give us streaming methods quite easily. Then we can layer on option 2 or 3 depending on how we feel about the scope of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good framing, thanks for writing it up. Okay, this makes sense to me!


### Updating your list of conversations

Expand Down
Loading