Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

users mysteriously ejected from HQ #7742

Closed
richvdh opened this issue Jun 25, 2020 · 20 comments
Closed

users mysteriously ejected from HQ #7742

richvdh opened this issue Jun 25, 2020 · 20 comments
Assignees
Labels
z-bug (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Jun 25, 2020

last night, at least 3 users (grin, sbug, lieuwe) found themselves no longer part of matrix HQ and had to rejoin. It would be good to dig into what happened

@richvdh
Copy link
Member Author

richvdh commented Jun 25, 2020

we have logs from lieuwe which include the moderately interesting:

rav@fred:~$ grep 'persist_events-2666' /tmp/mozilla_rav0/synapse-#matrix-leave.log
2020-06-24 20:14:01,960 - synapse.state - 495 - INFO - persist_events-2666 - Resolving state for !OGEhHVWSdvArJzumhm:matrix.org with 2 groups
2020-06-24 20:14:01,965 - synapse.state - 518 - INFO - persist_events-2666 - Resolving conflicted state for '!OGEhHVWSdvArJzumhm:matrix.org'
2020-06-24 20:14:11,871 - synapse.storage.persist_events - 421 - INFO - persist_events-2666 - Server no longer in room !OGEhHVWSdvArJzumhm:matrix.org

... which suggests that state resolution resulted in their server deciding they were no longer in the room. We think those logs are in

This sounds very much like #6774.

It would be interesting to try to figure out:

  • whether all three users were affected at the same time (might need to ask lieuwe for confirmation of their timezone)
  • whether other servers also reached the same conclusions at the same time about those users being state-reset out of the room
  • ... why this happened.

@erikjohnston
Copy link
Member

erikjohnston commented Jun 25, 2020

This happened to me too, the logs are somewhat fun (and match lieuw):

2020-06-24 19:13:34,026 - synapse.handlers.federation - 176 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - handling received PDU: <FrozenEventV3 event_id='$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg', type='m.room.message', state_key='None'>
2020-06-24 19:13:34,034 - synapse.handlers.federation - 256 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Acquiring room lock to fetch 1 missing prev_events: ['$xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg']
2020-06-24 19:13:34,034 - synapse.handlers.federation - 264 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Acquired room lock to fetch 1 missing prev_events
2020-06-24 19:13:34,043 - synapse.handlers.federation - 440 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg]: Requesting missing events between ['$3cLuclXCfOk55e-vI9dEkIxPIlZk5kyilh2H9EZKhKY'] and $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg
2020-06-24 19:13:34,044 - synapse.http.matrixfederationclient - 408 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg- - {POST-O-1563794} [tchncs.de] Sending request: POST matrix://tchncs.de/_matrix/federation/v1/get_missing_events/%21OGEhHVWSdvArJzumhm%3Amatrix.org; timeout 60.000000s
2020-06-24 19:13:34,047 - synapse.http.federation.matrix_federation_agent - 250 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg- - Connecting to tchncs.de:8448
2020-06-24 19:13:41,310 - synapse.http.matrixfederationclient - 158 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - {POST-O-1563794} [tchncs.de] Completed: 200 OK
2020-06-24 19:13:41,313 - synapse.handlers.federation - 515 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg]: Got 3 prev_events: [<FrozenEventV3 event_id='$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg', type='m.room.message', state_key='None'>, <FrozenEventV3 event_id='$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674', type='m.room.message', state_key='None'>, <FrozenEventV3 event_id='$xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg', type='m.room.message', state_key='None'>]
2020-06-24 19:13:41,314 - synapse.handlers.federation - 528 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Handling received prev_event $ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg
2020-06-24 19:13:41,314 - synapse.handlers.federation - 176 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - handling received PDU: <FrozenEventV3 event_id='$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg', type='m.room.message', state_key='None'>
2020-06-24 19:13:44,772 - synapse.handlers.federation - 335 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - Event $ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg is missing prev_events: calculating state for a backwards extremity
2020-06-24 19:13:44,772 - synapse.handlers.federation - 357 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - Requesting state at missing prev_event $ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg
2020-06-24 19:13:44,773 - synapse.http.matrixfederationclient - 408 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg-$z4389liaANgOTs1qmkiII7LDnvXIhLfu2dUJ0z3brjc- - {GET-O-1563795} [tchncs.de] Sending request: GET matrix://tchncs.de/_matrix/federation/v1/state_ids/%21OGEhHVWSdvArJzumhm%3Amatrix.org?event_id=%24z4389liaANgOTs1qmkiII7LDnvXIhLfu2dUJ0z3brjc; timeout 60.000000s
2020-06-24 19:13:47,164 - synapse.http.matrixfederationclient - 461 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg-$z4389liaANgOTs1qmkiII7LDnvXIhLfu2dUJ0z3brjc- - {GET-O-1563795} [tchncs.de] Got response headers: 404 Not Found
2020-06-24 19:13:47,165 - synapse.http.matrixfederationclient - 536 - WARNING - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg-$z4389liaANgOTs1qmkiII7LDnvXIhLfu2dUJ0z3brjc- - {GET-O-1563795} [tchncs.de] Request failed: GET matrix://tchncs.de/_matrix/federation/v1/state_ids/%21OGEhHVWSdvArJzumhm%3Amatrix.org?event_id=%24z4389liaANgOTs1qmkiII7LDnvXIhLfu2dUJ0z3brjc: HttpResponseException("404: b'Not Found'")
2020-06-24 19:13:47,166 - synapse.handlers.federation - 400 - WARNING - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - [!OGEhHVWSdvArJzumhm:matrix.org $ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg] Error attempting to resolve state at missing prev_events
2020-06-24 19:13:47,166 - synapse.handlers.federation - 539 - WARNING - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Received prev_event $ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg failed history check.
2020-06-24 19:13:47,166 - synapse.handlers.federation - 528 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Handling received prev_event $7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674
2020-06-24 19:13:47,166 - synapse.handlers.federation - 176 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - handling received PDU: <FrozenEventV3 event_id='$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674', type='m.room.message', state_key='None'>
2020-06-24 19:13:47,172 - synapse.handlers.federation - 335 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - Event $7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 is missing prev_events: calculating state for a backwards extremity
2020-06-24 19:13:47,173 - synapse.handlers.federation - 357 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - Requesting state at missing prev_event $7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674
2020-06-24 19:13:47,174 - synapse.http.matrixfederationclient - 408 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg- - {GET-O-1563796} [tchncs.de] Sending request: GET matrix://tchncs.de/_matrix/federation/v1/state_ids/%21OGEhHVWSdvArJzumhm%3Amatrix.org?event_id=%24ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg; timeout 60.000000s
2020-06-24 19:13:50,287 - synapse.http.matrixfederationclient - 158 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - {GET-O-1563796} [tchncs.de] Completed: 200 OK
2020-06-24 19:13:50,292 - synapse.http.matrixfederationclient - 408 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg- - {GET-O-1563798} [tchncs.de] Sending request: GET matrix://tchncs.de/_matrix/federation/v1/event/%24ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg; timeout 60.000000s
2020-06-24 19:13:51,684 - synapse.http.matrixfederationclient - 158 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - {GET-O-1563798} [tchncs.de] Completed: 200 OK
2020-06-24 19:13:51,689 - synapse.handlers.federation - 2340 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg-$ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg - Skipping auth_event fetch for outlier
2020-06-24 19:13:51,725 - synapse.handlers.federation - 2350 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - auth_events refers to events which are not in our calculated auth chain: {'$9s0TWzih_V7KH4TJq4RVK8fwXS6ladAoRF3zjh3j_PQ', '$MaPUZG2fNleIE_gUYDmOy-nj0Zom8mQ5FNM7gEAVI6I', '$Mu9qujTV7no2PtHgEILrPOib5TJ7YMyH92lWIBPB_sg'}
2020-06-24 19:13:51,725 - synapse.state - 406 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - Resolving state for !OGEhHVWSdvArJzumhm:matrix.org with 2 groups
2020-06-24 19:13:51,743 - synapse.handlers.federation - 2389 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - After state res: updating auth_events with new state {('m.room.create', ''): '$MaPUZG2fNleIE_gUYDmOy-nj0Zom8mQ5FNM7gEAVI6I', ('m.room.member', '@abuse:matrix.org'): '$3Z5K-4YJq0uYYAIplFHm7V8vRZgfjCoRV74N8wVWiWw', ('m.room.power_levels', ''): '$9s0TWzih_V7KH4TJq4RVK8fwXS6ladAoRF3zjh3j_PQ', ('m.room.join_rules', ''): '$gAwT-xGHhlqqg9bnWaP_AyT7j6qjyeWsvCyN2dHYpHw', ('m.room.member', '@travis:t2l.io'): '$KzA3bPapYE2mCW4EpSnlAaY1xJ0LrNJjxsPLq4hmqoQ', ('m.room.member', '@grin:construct.grin.hu'): '$Mu9qujTV7no2PtHgEILrPOib5TJ7YMyH92lWIBPB_sg'}
2020-06-24 19:13:54,366 - synapse.state - 406 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 - Resolving state for !OGEhHVWSdvArJzumhm:matrix.org with 2 groups
2020-06-24 19:14:21,136 - synapse.handlers.federation - 528 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Handling received prev_event $xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg
2020-06-24 19:14:21,137 - synapse.handlers.federation - 176 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg - handling received PDU: <FrozenEventV3 event_id='$xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg', type='m.room.message', state_key='None'>
2020-06-24 19:14:21,140 - synapse.handlers.federation - 225 - INFO - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg-$xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg - [!OGEhHVWSdvArJzumhm:matrix.org $xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg] Ignoring PDU from tchncs.de as we're not in the room
2020-06-24 19:14:21,144 - synapse.handlers.federation - 318 - WARNING - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - [!OGEhHVWSdvArJzumhm:matrix.org $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg] Rejecting: failed to fetch 1 prev events: ['$xRWlkdZHyd3w0A6aoS2nfbCYjFT1f_e5Zt3eegLXjqg']
2020-06-24 19:14:21,144 - synapse.federation.federation_server - 279 - WARNING - PUT-851530-$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg - Error handling PDU $-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg: ERROR 403: Your server isn't divulging details about prev_events referenced in this event.

$-hIQfrHyUVY0tdT-2cBcGiiXlOMXJPQ932RnqTKfQPg is the event that is triggering this

@erikjohnston
Copy link
Member

On jki.re looks like event $7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 ended up hitting _update_context_for_auth_events, which in turns persists the new state here: https://github.com/matrix-org/synapse/blob/develop/synapse/handlers/federation.py#L2444-L2450. However, the existing context.state_group that is used as the prev_group is empty in the database, causing synapse to think the state of $7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674 is a very small subset of the state. Doing state res across that and current state causes a bunch of users to be kicked out of the room.

@clokep clokep added z-bug (Deprecated Label) p1 labels Jun 25, 2020
@erikjohnston
Copy link
Member

One interesting thing here is that if you ask servers for the state of $ucWEzHZLdZ0IjojmIxb9MdJ8_Y0jGm6l_5WIMFGxPQg (which is the prev event for $7CJFlCOP2ESM1rMVPu1_fpQ28yrv9SFV4uv0xb9m674) we get a small set of state:

{
  "pdu_ids": [
    "$MaPUZG2fNleIE_gUYDmOy-nj0Zom8mQ5FNM7gEAVI6I", "$gAwT-xGHhlqqg9bnWaP_AyT7j6qjyeWsvCyN2dHYpHw", "$3Z5K-4YJq0uYYAIplFHm7V8vRZgfjCoRV74N8wVWiWw", "$Mu9qujTV7no2PtHgEILrPOib5TJ7YMyH92lWIBPB_sg", "$KzA3bPapYE2mCW4EpSnlAaY1xJ0LrNJjxsPLq4hmqoQ", "$9s0TWzih_V7KH4TJq4RVK8fwXS6ladAoRF3zjh3j_PQ"
  ],
  "auth_chain_ids": [
    "$MzrJBHz4bPtW5p-m9mlIjk6_FfXSBrVoEqC9qPa78IQ", "$VTSRTz8yrsbhmpMTBDjl3NntWClCvoihwb1eO-yiHqw", "$0v15XhSH1i3UNZgST6wN5VMf4ItOI3CE0_PTsqlD1cE", "$KzA3bPapYE2mCW4EpSnlAaY1xJ0LrNJjxsPLq4hmqoQ", "$gArjO1RTILcA5p8pYpmc0mlGx581NMj90gEG--89DBw", "$3Z5K-4YJq0uYYAIplFHm7V8vRZgfjCoRV74N8wVWiWw", "$2YGE5c9okZ4jTkbdn5xnGAvaIAlYYBQCzp-ckeyCfEA", "$u2XxFBinIjuK-xBS9cgMiLpOcPw5TCpqQFzNC2uEVFU", "$MaPUZG2fNleIE_gUYDmOy-nj0Zom8mQ5FNM7gEAVI6I", "$WvqflEV-JrulPCpJcyG2LYuiNObeiYEeUAP_NgQwvHY", "$fKDKYB2rIc1VilsiYJoDE-ShiLLMA4MWo3cQOH_fz1c", "$yq8UADQMFWuZVR1wcx0_va7SpJTKv8fJXkI4ax0AxIg", "$9s0TWzih_V7KH4TJq4RVK8fwXS6ladAoRF3zjh3j_PQ", "$eMqfjsGLmwAnmcRNELX5mT6FDIkZiYMLnlzlqfklE8k", "$UYWIGiJzKJEDoGas39f_ArFfdU06Ygzs0dXzJrVQ3TY", "$gAwT-xGHhlqqg9bnWaP_AyT7j6qjyeWsvCyN2dHYpHw"
  ]
}

@joepie91
Copy link

joepie91 commented Jun 27, 2020

I (@joepie91:pixie.town) am no longer receiving events for Matrix HQ, but I still appear to be joined from my side, and there's no "you got kicked" screen or anything like that. The last visible event dates from June 24, 20:13:37 local time (Netherlands).

@f0x52
Copy link

f0x52 commented Jun 27, 2020

almost certainly homeserver-wide for pixie.town, and trying to send a message results in
image

@f0x52
Copy link

f0x52 commented Jun 27, 2020

my synapse logs only go back to yesterday, but lists a lot of PDU's being rejected for HQ (as we are no longer in the room),
and weirdly enough a lot of state resolution, even though we're apparently no longer in it

@lieuwex
Copy link

lieuwex commented Jun 28, 2020

might need to ask lieuwe for confirmation of their timezone

My timezone is GMT+2 (Netherlands)
So @joepie91 seems to be affected around the same time.

@f0x52
Copy link

f0x52 commented Jun 28, 2020

I seem to be back in HQ now after:
leaving room from riot (error message saying I'm not in the room)
clearing cache and reloading
#matrix:matrix.org now gives the preview, and asks if I want to join
hit join, get an error after a timeout
$magic happens in the background
and now seem to federate back and forth again, as verified via the freenode bridge

@erikjohnston
Copy link
Member

Yup, it's worth noting that this is exacerbated by the issue that state resets don't get sent down /sync (as we only send down state events in the timeline for incremental sync, so have no way of "rolling back" state). I'm completely failing to find the traffic issue for this though.

@joepie91
Copy link

leaving room from riot (error message saying I'm not in the room)

Worth noting that I didn't explicitly leave the room, but HQ is still fixed for me (on the same homeserver as @f0x52).

@richvdh
Copy link
Member Author

richvdh commented Jun 29, 2020

Yup, it's worth noting that this is exacerbated by the issue that state resets don't get sent down /sync (as we only send down state events in the timeline for incremental sync, so have no way of "rolling back" state). I'm completely failing to find the traffic issue for this though.

matrix-org/matrix-spec#1209

@erikjohnston
Copy link
Member

Saw this again. Room !gGkYcJLTzcNrfERksd:matrix.org, events involved are $6P6qG6eJSfjhIEHnIloMLnV0Zxiv4cxq4iuau6eimtI, $4IB5cJZJn7V97cW8EnTID_Wbt6PPVdHJvjKXiqaED0k, and $vi6_b4U3Z7YjOIrqzfHuX_FkD14QEnkrMh_rsSYtl9k

Ok, so:

  1. Handle $6P correctly at 18:01 (UTC)
  2. Persist $4I in the database as an outlier (it does not appear in the federation inbound logs???) at 18:06:39
  3. Receive $vi at 18:08, do some of the _update_context_for_auth_events stuff, end up persisting with a subset of state (as its prev_group 465032780 points to an empty state, which looks to be an unreferenced state group created for $vi)
  4. Receive $4I at 18:09. Mark as non-outlier. It seems to be persisted with correct state. Remove $6P from current state so its just $vi, which has incorrect state which becomes the current state

Later on, if we try and get the current state for $4IB5cJZJn7V97cW8EnTID_Wbt6PPVdHJvjKXiqaED0k on federation inbound it returns and empty state, while elsewhere it returns the correct state. I'm guessing that the state caches have gone wibbly, probably due to $vi having been given a new state group while processing. Maybe the original state group for $vi got out of sync between federation inbound and db (in the db its empty but exists)? That would then have a knock on effect as when we calculate state for future events we'll base it off that bad state group, and so we end up persisting a delta to that state group that only exists in memory and not in the DB (or other workers)

@richvdh
Copy link
Member Author

richvdh commented Jul 2, 2020

fixed in 96e9afe / synapse 1.15.2.

@richvdh richvdh closed this as completed Jul 2, 2020
@lieuwex
Copy link

lieuwex commented Jul 2, 2020

Since the changelog speaks about "malicious homeserver[s]", were the ejections in Matrix HQ done on purpose?

@Cadair
Copy link
Contributor

Cadair commented Jul 2, 2020

The changelog also states:

We are not aware of either of these vulnerabilities being exploited in the wild

so no.

@joepie91
Copy link

joepie91 commented Jul 2, 2020

@richvdh In the interest of other homeserver implementers not making the same mistake, is there any further documentation on what design/implementation decision caused the issue and how it is now being prevented?

@richvdh
Copy link
Member Author

richvdh commented Jul 2, 2020

in short: if you're going to check that you have the state at a particular point in the DAG, you better check that you actually have the state at that point in the DAG.

When you first join a room, you get the state of that room, which might be 1000 events. We don't need to know the state of the room at each of those events - it is sufficient to know that the exist, that they are correctly signed, and what they contain. We therefore store these events as "outliers" in synapse: ie, we have the event itself, but no information about the state at that point in the DAG. We can also get similar outlier events whenever we end up with new gaps in the DAG.

The code that was fixed was incorrectly assuming that if we had an event in the database, we also knew what the state of the room was at that event; if it was an outlier, when we tried to use that state, we'd get an empty state set, and obviously that's going to go badly.

We think it's a security issue because we think it's possible to engineer that situation. We have no reason to believe that the ejections from HQ were deliberate, though.

@joepie91
Copy link

joepie91 commented Jul 2, 2020

I see, thanks!

@sbug-1bit
Copy link

sbug-1bit commented Jul 19, 2020

Today I was once again thrown out of Matrix HQ.
I wounder if this could be the same bug still lurking around or a new one?

My synapse log file: https://1bit.se/files/homeserver.log.1
(Loglevel set to INFO)

At 11:05:47 in the logfile I see an error resolving state for the room matrix hq.
After that time I also no longer got any new events in the room in my client. And after clear cache and reload it was revealed to me that I had been thrown out of the room.

I should mention that I use synapse 1.17.0

I made a new bug report here #7913 in case this is another issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

8 participants