Clarify that arbitrary unicode is allowed in user/room IDs and room aliases #1506

tulir · 2023-05-01T10:23:49Z

Nobody cares enough to implement matrix-org/matrix-spec-proposals#2828 or other spec changes that would disallow arbitrary unicode in the three identifiers, so here's a spec clarification to document how Matrix currently works.

Preview: https://pr1506--matrix-spec-previews.netlify.app

…liases. Signed-off-by: Tulir Asokan <tulir@maunium.net>

anoadragon453

This appears like the right thing to do to me, but I'd like a second opinion.

richvdh · 2023-05-16T13:52:26Z

For the record: whilst I would love to see MSC2828 implemented, we can't actually do that without a room version bump, so, regardless of what happens with MSC2828, we have to document the status quo as it is :(.

richvdh · 2023-05-16T13:47:24Z

content/appendices.md

+##### User IDs over federation
+
+Due to a lack of validation in original Matrix homeserver implementations,
+the localpart of user IDs over federation may contain any valid unicode
+codepoints except `:`. A future spec change may create a new room version
+to disallow such user IDs.


aaaaaaaargh everything is awful

I can't see any difference between this and "Historical user IDs". They are both "IDs you're not supposed to create but might have to deal with because apparently validation is hard"

do they even have to be valid unicode codepoints?

aaaaaaaargh everything is awful

👍

I can't see any difference between this and "Historical user IDs". They are both "IDs you're not supposed to create but might have to deal with because apparently validation is hard"

The only difference is that there are real users who have created historical user IDs on normal clients and servers. Arbitrary unicode ids exist in the wild too, but it has never been possible to create them without modifying the server.

do they even have to be valid unicode codepoints?

I think server implementations try to decode utf-8 sufficiently early so that invalid unicode user IDs explode before being persisted anywhere, but I'm not completely sure.

The only difference is that there are real users who have created historical user IDs on normal clients and servers. Arbitrary unicode ids exist in the wild too, but it has never been possible to create them without modifying the server.

I don't see this as a significant difference for someone implementing a server (or a client). The fact is that they are both cr@p that you have to deal with.

I think server implementations try to decode utf-8 sufficiently early so that invalid unicode user IDs explode before being persisted anywhere, but I'm not completely sure.

ok, that seems fair. Let's assume they do have to be UTF-8 until someone proves otherwise :)

(though, per #365 (comment): "valid UTF-8" is open to definition too...)

richvdh · 2023-05-16T13:49:58Z

content/appendices.md

+The localpart of a room alias may contain any valid unicode codepoints
+except `:`.
+
+Room aliases MUST NOT exceed 255 bytes as UTF-8 (including the `#` sigil
+and the domain).


similar questions: do they really have to be valid UTF-8? It probably depends on where they are used, I suppose...

I think we should assume yes until proven otherwise, when we can update the spec again.

turt2live · 2023-08-09T18:13:18Z

@richvdh I've updated this with permission from tulir - please take another look.

content/appendices.md

kegsay · 2023-08-15T12:50:17Z

changelogs/appendices/newsfragments/1506.clarification

@@ -0,0 +1 @@
+Clarify that arbitrary unicode is allowed in user/room IDs and room aliases.


Strongly disagree on this.

The original vintage 2014 Synapse implementation only allowed non URL quotable characters in user localparts when registering for an account. Unfortunately, federation did not apply the same check, alluded to by other comments here:

Arbitrary unicode ids exist in the wild too, but it has never been possible to create them without modifying the server.

If you are playing silly games (removing validation checks on your server) we should NOT set the precedent that we are just going to accept your games imo, otherwise where do you draw the line? Some comments in this proposal mention about excluding the null byte, but why? Sure, postgres cannot represent it when used with TEXT columns, but hey I am using a modified Synapse which doesn't have this problem, so why are we allowing some silly games but not others?

The result is an inconsistent mess, and it was never designed to be that way. This isn't like other cases where "hey this is what synapse does, in the spec it goes" because in order to get the failure mode of unicode characters you need a malicious and/or buggy actor.

The consequences of making this the rule in the specification, and hence removing these checks in Dendrite, Conduit, et al is an increased risk of homograph attacks. I cannot and will not support increasing the attack surface of Matrix just because a few people back in 2014 removed validation and sent unicode user IDs into a room, which synapse accepted.

It is worth emphasising that room versions ARE NOT a get-out-of-jail-free card here, as user IDs are outside the scope of rooms. For example, the sliding sync proxy recently had an issue with unicode user IDs in device list changes. It's not hard to see how this can also be an issue with the user directory and to-device msgs, both of which sit outside of rooms.

Counter proposal: sorry folks with smiley poos as user localparts, you're going to be broken in the next release of Synapse, and we remove / subsequently ignore events with malformed user IDs aka what we should have done in 2014.

While I agree that this only really happens because of people modifying their servers, the result of enforcing those checks in one implementation and not in another can end up being outright disastrous for innocent users.

For example, if Dendrite enforces these checks but Synapse doesn't, then all it takes is for a Synapse user with an emoji localpart to make a power level change or perform any kind of power action to break the room for Dendrite users probably irreversibly.

Hell, they might not even have to perform a power action, because we might drop auth events (such as the user joining to begin with) or refuse to pull in prev events (from normal timeline events), which in turn causes us to run state resolution with a different set of input events (which can result in a different output state set or a complete state reset) or to accidentally propagate broken state to other servers when they ask us for /state_ids or /state.

This situation 100% sucks but Matrix only functions if implementations agree to handle these things in the same way, otherwise different implementations will never arrive at a consistent state. We already see this happen due to other corner cases and it just ends up feeling terrible for all users involved.

As for room versions, it is true that it's not a great solution to the problem due to the fact that it still leaks into device updates or other areas, but at least it's possible in a future room version to make sure that users with invalid localparts can't join the rooms to begin with. That is a huge step towards stamping out invalid localparts.

kegsay · 2023-08-15T12:54:00Z

content/appendices.md

+Servers SHOULD NOT produce user IDs with localparts outside of the following
+character set, and SHOULD NOT forward such user IDs to clients when referenced
+outside the context of an event. For example, device list updates from "invalid"
+user IDs would be dropped by the receiving server.


This is worse than just removing events by these users, as things will only half work (E2EE will entirely break for them in weird and wonderful ways).

If you set your localpart to a smiley poo, E2EE will break for you. That's fine by me. What I care about is you not being able to DoS a room by setting your localpart to a smiley poo, which is why we have to accept events over federation whose senders are invalid; we don't have to do much else to help such users.

richvdh · 2024-02-27T13:52:46Z

@tulir, @turt2live: would one of you be able to address the questions in my review #1506 (review)?

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>

turt2live · 2024-03-03T00:14:07Z

[...], @turt2live: would one of you be able to address the questions in my review #1506 (review)?

Which questions? The link hits a generic review. If there's outstanding comments needed, please ping inline on the threads.

richvdh · 2024-03-04T11:54:33Z

Which questions? The link hits a generic review. If there's outstanding comments needed, please ping inline on the threads.

Looks like @tulir has updated the PR and addressed my questions since I wrote that. Thanks @tulir!

richvdh · 2024-03-12T21:57:14Z

content/appendices.md

+Servers SHOULD NOT produce user IDs with localparts outside of the following
+character set, and SHOULD NOT forward such user IDs to clients when referenced
+outside the context of an event. For example, device list updates from "invalid"
+user IDs would be dropped by the receiving server.


How about this:

Suggested change

Servers SHOULD NOT produce user IDs with localparts outside of the following

character set, and SHOULD NOT forward such user IDs to clients when referenced

outside the context of an event. For example, device list updates from "invalid"

user IDs would be dropped by the receiving server.

User IDs with localparts containing characters outside the range U+0021 to U+007E, or with

an empty localpart, are considered non-compliant. For current room versions, servers must

still accept events using such user IDs over federation; however they SHOULD NOT forward

such user IDs to clients when referenced outside the context of an event. For example,

device list updates from non-compliant user IDs would be dropped by the receiving server.

Clarify that arbitrary unicode is allowed in user/room IDs and room a…

6215072

…liases. Signed-off-by: Tulir Asokan <tulir@maunium.net>

tulir requested a review from a team as a code owner May 1, 2023 10:23

anoadragon453 reviewed May 4, 2023

View reviewed changes

anoadragon453 requested a review from a team May 4, 2023 18:55

clokep mentioned this pull request May 5, 2023

MSC4009: Expanding the Matrix ID grammar to enable E.164 IDs matrix-org/matrix-spec-proposals#4009

Merged

richvdh requested changes May 16, 2023

View reviewed changes

turt2live added 2 commits August 9, 2023 12:04

Merge branch 'main' into clarify-identifier-localparts

ec55c38

Clarify historical ID set further

459634f

turt2live requested a review from richvdh August 9, 2023 18:13

richvdh requested changes Aug 10, 2023

View reviewed changes

neilalexander mentioned this pull request Aug 15, 2023

Disable character range check for user IDs matrix-org/gomatrixserverlib#410

Open

kegsay suggested changes Aug 15, 2023

View reviewed changes

kegsay reviewed Aug 15, 2023

View reviewed changes

tulir and others added 2 commits February 28, 2024 00:59

Apply suggestions from code review

97381fd

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>

Clarify allowed control characters

9b75333

richvdh self-requested a review March 4, 2024 11:54

richvdh reviewed Mar 12, 2024

View reviewed changes

BramvdnHeuvel mentioned this pull request Apr 12, 2024

Expose Matrix.User noordstar/elm-matrix-sdk-beta#19

Merged

Kladki mentioned this pull request Jul 8, 2024

Servers don't enforce PDU format #1904

Open

tulir mentioned this pull request Jul 16, 2024

Update grammar on Matrix Ids to be more spec compliant and render error instead of infinite loading in room member list screen element-hq/element-x-android#3206

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that arbitrary unicode is allowed in user/room IDs and room aliases #1506

Clarify that arbitrary unicode is allowed in user/room IDs and room aliases #1506

tulir commented May 1, 2023 •

edited by github-actions bot

Loading

anoadragon453 left a comment

richvdh commented May 16, 2023

richvdh May 16, 2023

richvdh May 16, 2023

tulir May 16, 2023

richvdh May 16, 2023

richvdh May 16, 2023

richvdh May 16, 2023

richvdh May 16, 2023

turt2live Aug 9, 2023

turt2live commented Aug 9, 2023

kegsay Aug 15, 2023

neilalexander Aug 15, 2023

kegsay Aug 15, 2023

richvdh Mar 12, 2024

richvdh commented Feb 27, 2024

turt2live commented Mar 3, 2024

richvdh commented Mar 4, 2024

richvdh Mar 12, 2024

		@@ -0,0 +1 @@
		Clarify that arbitrary unicode is allowed in user/room IDs and room aliases.

-Servers SHOULD NOT produce user IDs with localparts outside of the following
-character set, and SHOULD NOT forward such user IDs to clients when referenced
-outside the context of an event. For example, device list updates from "invalid"
-user IDs would be dropped by the receiving server.
+User IDs with localparts containing characters outside the range U+0021 to U+007E, or with
+an empty localpart, are considered non-compliant. For current room versions, servers must
+still accept events using such user IDs over federation; however they SHOULD NOT forward
+such user IDs to clients when referenced outside the context of an event. For example,
+device list updates from non-compliant user IDs would be dropped by the receiving server.

Clarify that arbitrary unicode is allowed in user/room IDs and room aliases #1506

Are you sure you want to change the base?

Clarify that arbitrary unicode is allowed in user/room IDs and room aliases #1506

Conversation

tulir commented May 1, 2023 • edited by github-actions bot Loading

anoadragon453 left a comment

Choose a reason for hiding this comment

richvdh commented May 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turt2live commented Aug 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh commented Feb 27, 2024

turt2live commented Mar 3, 2024

richvdh commented Mar 4, 2024

Choose a reason for hiding this comment

tulir commented May 1, 2023 •

edited by github-actions bot

Loading