Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Stop getting missing prev_events after we already know their signature is invalid #13816

Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
924ae2b
Track when the pulled event signature fails
MadLittleMods Sep 14, 2022
d240aeb
Add changelog
MadLittleMods Sep 14, 2022
cfb4e88
Fix reference
MadLittleMods Sep 14, 2022
9b390a3
Stop getting missing prev_events after we already know their signatur…
MadLittleMods Sep 15, 2022
3d8507d
Merge branch 'develop' into madlittlemods/13700-track-when-event-sign…
MadLittleMods Sep 23, 2022
88a75cf
Use callback pattern to record signature failures
MadLittleMods Sep 23, 2022
d29ac0b
Add docstring
MadLittleMods Sep 24, 2022
14e39ee
Record failure from get_pdu and add test
MadLittleMods Sep 24, 2022
7898371
Be more selective about which errors to care about
MadLittleMods Sep 30, 2022
43f1d1a
Merge branch 'develop' into madlittlemods/13700-track-when-event-sign…
MadLittleMods Sep 30, 2022
e32b901
Merge branch 'madlittlemods/13700-track-when-event-signature-fails' i…
MadLittleMods Sep 30, 2022
83feb1b
Merge branch 'develop' into madlittlemods/13700-track-when-event-sign…
MadLittleMods Oct 1, 2022
7d102e8
Merge branch 'develop' into madlittlemods/13700-track-when-event-sign…
MadLittleMods Oct 1, 2022
81410b6
Merge branch 'madlittlemods/13700-track-when-event-signature-fails' i…
MadLittleMods Oct 1, 2022
958fd3b
Merge branch 'develop' into madlittlemods/13622-13700-stop-getting-st…
MadLittleMods Oct 3, 2022
e24db41
Weird name and add tests
MadLittleMods Oct 4, 2022
f853e78
Add changelog
MadLittleMods Oct 4, 2022
43fb6b8
Bail early and fix lints
MadLittleMods Oct 4, 2022
03f23b7
Add integration test with backfill
MadLittleMods Oct 4, 2022
99d3e79
Fix lints and better message
MadLittleMods Oct 4, 2022
f11f5b5
Fix test description
MadLittleMods Oct 4, 2022
6878faa
Fix test descriptions
MadLittleMods Oct 4, 2022
f3b443d
Scratch debug changes
MadLittleMods Oct 5, 2022
d135d41
Stop cascading backoff errors
MadLittleMods Oct 5, 2022
4effca9
Remove scratch changes
MadLittleMods Oct 5, 2022
e3cc054
Use custom FederationPullAttemptBackoffError
MadLittleMods Oct 5, 2022
74f9e03
Fix test to reflect no more cascade
MadLittleMods Oct 5, 2022
0b900e1
Better comments from what I've gathered
MadLittleMods Oct 5, 2022
3cb2826
Clarify which error in comments
MadLittleMods Oct 5, 2022
5f313df
Add some clarification to the test comment
MadLittleMods Oct 5, 2022
02e6bdd
Merge branch 'develop' into madlittlemods/13622-13700-stop-getting-st…
MadLittleMods Oct 12, 2022
7f4fdd2
Not a SynapseError
MadLittleMods Oct 12, 2022
89ffbcb
Make sure usages of _compute_event_context_with_maybe_missing_prevs h…
MadLittleMods Oct 12, 2022
354f682
Remove double "recently"
MadLittleMods Oct 12, 2022
e0b0447
Do the calculation in Python because it's more clear when we need res…
MadLittleMods Oct 12, 2022
e72c4e5
Use built-in select many function
MadLittleMods Oct 12, 2022
4e08039
No need for txn
MadLittleMods Oct 12, 2022
b060652
Rename function to reflect functionality
MadLittleMods Oct 12, 2022
b1a0c1b
Fix test description to make it accurate
MadLittleMods Oct 12, 2022
bccd802
Use more standard string interpolation with `logger`
MadLittleMods Oct 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13816.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Stop fetching missing `prev_events` after we already know their signature is invalid.
19 changes: 19 additions & 0 deletions synapse/handlers/federation_event.py
Original file line number Diff line number Diff line change
Expand Up @@ -906,6 +906,11 @@ async def _process_pulled_event(

if e.code == 403:
logger.warning("Pulled event %s failed history check.", event_id)
elif e.code == 429:
logger.warning(
"Not attempting to pull event %s that we already tried to pull recently (backing off).",
event_id,
)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we shouldn't record a failure when it's a 429 when we're backing off fetching one of this events prev_events. Not being able to fetch the prev_events for this eventA, doesn't mean we won't be able to fetch eventA as a prev_event for another downstream event (is this actually true? is there anything special we can do with state/auth events).

Otherwise, this currently causes a cascade of failures through the whole DAG.

First one fails with ERROR 403: We can't get valid state history. then each subsequent event in the DAG records a pull attempt failure as ERROR 429: While computing context for event=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg, not attempting to pull missing prev_event=$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 because we already tried to pull recently (backing off)., and so on.

Example cascade from logs/server-0/client_reader.log running Sytest tests/48admin.pl "Can backfill purged history"
2022-10-05 08:38:33,218 - synapse.http.matrixfederationclient - 308 - INFO - GET-4 - {GET-O-1} [localhost:8879] Completed request: 200 OK in 0.02 secs, got 7821 bytes - GET matrix://localhost:8879/_matrix/federation/v1/backfill/%21NUgUcBPsJkCbellfTQ%3Alocalhost%3A8839?v=%24Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk&limit=100
2022-10-05 08:38:33,230 - synapse.handlers.federation_event - 826 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - Processing pulled event <FrozenEventV3 event_id=$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78, type=m.room.member, state_key=@anon-20221005_083744-7:localhost:8879, outlier=False>
2022-10-05 08:38:33,234 - synapse.handlers.federation_event - 859 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - De-outliering event $Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78
2022-10-05 08:38:33,243 - synapse.handlers.federation_event - 973 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - _compute_event_context_with_maybe_missing_prevs(event=$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78) prevs_to_ignore=[]
2022-10-05 08:38:33,243 - synapse.handlers.federation_event - 990 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - Event $Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 is missing prev_events ['$XCQvvErHXkLvYbzgEgKI-MTDN6HsFbFzbAcD5yX9U44']: calculating state for a backwards extremity
2022-10-05 08:38:33,244 - synapse.handlers.federation_event - 1016 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - Requesting state after missing prev_event $XCQvvErHXkLvYbzgEgKI-MTDN6HsFbFzbAcD5yX9U44
2022-10-05 08:38:33,257 - synapse.http.matrixfederationclient - 634 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78-$XCQvvErHXkLvYbzgEgKI-MTDN6HsFbFzbAcD5yX9U44 - {GET-O-2} [localhost:8879] Got response headers: 404 Not Found
2022-10-05 08:38:33,257 - synapse.http.matrixfederationclient - 715 - WARNING - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78-$XCQvvErHXkLvYbzgEgKI-MTDN6HsFbFzbAcD5yX9U44 - {GET-O-2} [localhost:8879] Request failed: GET matrix://localhost:8879/_matrix/federation/v1/state_ids/%21NUgUcBPsJkCbellfTQ%3Alocalhost%3A8839?event_id=%24XCQvvErHXkLvYbzgEgKI-MTDN6HsFbFzbAcD5yX9U44: HttpResponseException('404: Not Found')
2022-10-05 08:38:33,258 - synapse.handlers.federation_event - 1042 - WARNING - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - Error attempting to resolve state at missing prev_events
Traceback (most recent call last):
  File "/synapse/synapse/handlers/federation_event.py", line 1024, in _compute_event_context_with_maybe_missing_prevs
    dest, room_id, p
  File "/synapse/synapse/logging/opentracing.py", line 889, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
  File "/synapse/synapse/logging/opentracing.py", line 889, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
  File "/synapse/synapse/handlers/federation_event.py", line 1088, in _get_state_ids_after_missing_prev_event
    destination, room_id, event_id=event_id
  File "/synapse/synapse/logging/opentracing.py", line 889, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
  File "/synapse/synapse/logging/opentracing.py", line 889, in _wrapper
    return await func(*args, **kwargs)  # type: ignore[misc]
  File "/synapse/synapse/federation/federation_client.py", line 479, in get_room_state_ids
    destination, room_id, event_id=event_id
  File "/synapse/synapse/federation/transport/client.py", line 81, in get_room_state_ids
    try_trailing_slash_on_400=True,
  File "/synapse/synapse/http/matrixfederationclient.py", line 1084, in get_json
    timeout=timeout,
  File "/synapse/synapse/http/matrixfederationclient.py", line 406, in _send_request_with_optional_trailing_slash
    response = await self._send_request(request, **send_request_args)
  File "/synapse/synapse/http/matrixfederationclient.py", line 668, in _send_request
    raise exc
synapse.api.errors.HttpResponseException: 404: Not Found
2022-10-05 08:38:33,259 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78, cause=ERROR 403: We can't get valid state history.
2022-10-05 08:38:33,262 - synapse.handlers.federation_event - 908 - WARNING - GET-4-$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 - Pulled event $Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 failed history check.
2022-10-05 08:38:33,262 - synapse.handlers.federation_event - 826 - INFO - GET-4-$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg - Processing pulled event <FrozenEventV3 event_id=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg, type=m.room.name, state_key=, outlier=False>
2022-10-05 08:38:33,265 - synapse.handlers.federation_event - 859 - INFO - GET-4-$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg - De-outliering event $tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg
2022-10-05 08:38:33,269 - synapse.handlers.federation_event - 973 - INFO - GET-4-$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg - _compute_event_context_with_maybe_missing_prevs(event=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg) prevs_to_ignore=['$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78']
2022-10-05 08:38:33,269 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg, cause=ERROR 429: While computing context for event=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg, not attempting to pull missing prev_event=$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 because we already tried to pull recently (backing off).
2022-10-05 08:38:33,271 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg - Not attempting to pull event=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg because of affected=$Dwi6IsKNmb9Zip89IuoDlZ6Wule6ke7YtavutZQ2o78 that we already tried to pull recently (backing off).
2022-10-05 08:38:33,271 - synapse.handlers.federation_event - 826 - INFO - GET-4-$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 - Processing pulled event <FrozenEventV3 event_id=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,277 - synapse.handlers.federation_event - 973 - INFO - GET-4-$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 - _compute_event_context_with_maybe_missing_prevs(event=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8) prevs_to_ignore=['$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg']
2022-10-05 08:38:33,277 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8, cause=ERROR 429: While computing context for event=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8, not attempting to pull missing prev_event=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg because we already tried to pull recently (backing off).
2022-10-05 08:38:33,279 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 - Not attempting to pull event=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 because of affected=$tPGVn9ZpXEPcWznLECrIL0HDPxqlBz1RFB8mZ3YKmGg that we already tried to pull recently (backing off).
2022-10-05 08:38:33,279 - synapse.handlers.federation_event - 826 - INFO - GET-4-$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U - Processing pulled event <FrozenEventV3 event_id=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,285 - synapse.handlers.federation_event - 973 - INFO - GET-4-$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U - _compute_event_context_with_maybe_missing_prevs(event=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U) prevs_to_ignore=['$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8']
2022-10-05 08:38:33,285 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U, cause=ERROR 429: While computing context for event=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U, not attempting to pull missing prev_event=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 because we already tried to pull recently (backing off).
2022-10-05 08:38:33,287 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U - Not attempting to pull event=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U because of affected=$XghVHINVOPDZ-QyfzjGZECDMlKBFjHvllhIR7kqemT8 that we already tried to pull recently (backing off).
2022-10-05 08:38:33,287 - synapse.handlers.federation_event - 826 - INFO - GET-4-$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk - Processing pulled event <FrozenEventV3 event_id=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,292 - synapse.handlers.federation_event - 973 - INFO - GET-4-$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk - _compute_event_context_with_maybe_missing_prevs(event=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk) prevs_to_ignore=['$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U']
2022-10-05 08:38:33,292 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk, cause=ERROR 429: While computing context for event=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk, not attempting to pull missing prev_event=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U because we already tried to pull recently (backing off).
2022-10-05 08:38:33,293 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk - Not attempting to pull event=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk because of affected=$_uS1OEZB1hMQlDnKQOWLla1tNDoLC4OjlZtpV4ke02U that we already tried to pull recently (backing off).
2022-10-05 08:38:33,294 - synapse.handlers.federation_event - 826 - INFO - GET-4-$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U - Processing pulled event <FrozenEventV3 event_id=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,298 - synapse.handlers.federation_event - 973 - INFO - GET-4-$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U - _compute_event_context_with_maybe_missing_prevs(event=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U) prevs_to_ignore=['$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk']
2022-10-05 08:38:33,298 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U, cause=ERROR 429: While computing context for event=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U, not attempting to pull missing prev_event=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk because we already tried to pull recently (backing off).
2022-10-05 08:38:33,299 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U - Not attempting to pull event=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U because of affected=$0xaQzHqGsK0unAXRX6b9U6nkNIKTTq_NXrYi7CCX6Vk that we already tried to pull recently (backing off).
2022-10-05 08:38:33,299 - synapse.handlers.federation_event - 826 - INFO - GET-4-$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw - Processing pulled event <FrozenEventV3 event_id=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,304 - synapse.handlers.federation_event - 973 - INFO - GET-4-$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw - _compute_event_context_with_maybe_missing_prevs(event=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw) prevs_to_ignore=['$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U']
2022-10-05 08:38:33,304 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw, cause=ERROR 429: While computing context for event=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw, not attempting to pull missing prev_event=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U because we already tried to pull recently (backing off).
2022-10-05 08:38:33,305 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw - Not attempting to pull event=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw because of affected=$MCjcVp__YCCorEi3FjoMa-k9WMIu71NdXqoc8P2862U that we already tried to pull recently (backing off).
2022-10-05 08:38:33,305 - synapse.handlers.federation_event - 826 - INFO - GET-4-$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc - Processing pulled event <FrozenEventV3 event_id=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,310 - synapse.handlers.federation_event - 973 - INFO - GET-4-$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc - _compute_event_context_with_maybe_missing_prevs(event=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc) prevs_to_ignore=['$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw']
2022-10-05 08:38:33,310 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc, cause=ERROR 429: While computing context for event=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc, not attempting to pull missing prev_event=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw because we already tried to pull recently (backing off).
2022-10-05 08:38:33,312 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc - Not attempting to pull event=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc because of affected=$xtZzN_ER45gozEm62m1ChGyNoaiWaNDNVY_r9iMBiqw that we already tried to pull recently (backing off).
2022-10-05 08:38:33,312 - synapse.handlers.federation_event - 826 - INFO - GET-4-$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU - Processing pulled event <FrozenEventV3 event_id=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,317 - synapse.handlers.federation_event - 973 - INFO - GET-4-$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU - _compute_event_context_with_maybe_missing_prevs(event=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU) prevs_to_ignore=['$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc']
2022-10-05 08:38:33,318 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU, cause=ERROR 429: While computing context for event=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU, not attempting to pull missing prev_event=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc because we already tried to pull recently (backing off).
2022-10-05 08:38:33,319 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU - Not attempting to pull event=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU because of affected=$HU9oV1TTzArEoqIUSl52TR58msts3A0xFxW1p5gBXVc that we already tried to pull recently (backing off).
2022-10-05 08:38:33,320 - synapse.handlers.federation_event - 826 - INFO - GET-4-$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM - Processing pulled event <FrozenEventV3 event_id=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,325 - synapse.handlers.federation_event - 973 - INFO - GET-4-$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM - _compute_event_context_with_maybe_missing_prevs(event=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM) prevs_to_ignore=['$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU']
2022-10-05 08:38:33,325 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM, cause=ERROR 429: While computing context for event=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM, not attempting to pull missing prev_event=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU because we already tried to pull recently (backing off).
2022-10-05 08:38:33,326 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM - Not attempting to pull event=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM because of affected=$guHvD5lD3sZEQZLcu48N1m2qEjJ5VFVxVxH3U66RSJU that we already tried to pull recently (backing off).
2022-10-05 08:38:33,327 - synapse.handlers.federation_event - 826 - INFO - GET-4-$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk - Processing pulled event <FrozenEventV3 event_id=$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk, type=m.room.message, state_key=None, outlier=False>
2022-10-05 08:38:33,332 - synapse.handlers.federation_event - 973 - INFO - GET-4-$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk - _compute_event_context_with_maybe_missing_prevs(event=$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk) prevs_to_ignore=['$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM']
2022-10-05 08:38:33,332 - synapse.storage.databases.main.event_federation - 1508 - INFO - GET-4-$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk - record_event_failed_pull_attempt room_id=!NUgUcBPsJkCbellfTQ:localhost:8839, event_id=$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk, cause=ERROR 429: While computing context for event=$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk, not attempting to pull missing prev_event=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM because we already tried to pull recently (backing off).
2022-10-05 08:38:33,333 - synapse.handlers.federation_event - 913 - WARNING - GET-4-$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk - Not attempting to pull event=$Gos2iAJL9b7Nh7z2Uk9QUpsWmelgfIk-HFIU6pOFiXk because of affected=$v3sNtiMJyAqNg2c3eCyic7RklAhVQkyqYi_eHQvP1RM that we already tried to pull recently (backing off).

Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been updated not to cascade backoff errors ⏩


Not being able to fetch the prev_events for this eventA, doesn't mean we won't be able to fetch eventA as a prev_event for another downstream event (is this actually true? is there anything special we can do with state/auth events).

Reviewer: Please give this an extra set of eyes 👀

Basically, if we can't pull a prev_event for an event, we can't de-outlier it (see outlier definition).

But we can still use an outlier in the state/auth chain for another event. So we shouldn't stop a downstream event from trying to pull it.

For example with these two scenarios, if we process the bolded event and are unable to pull the event before it, can we still use the bolded event when processing the event after?

  1. invite <- join <- message
  2. message1 <- message2 <- message3

else:
raise

Expand Down Expand Up @@ -955,6 +960,20 @@ async def _compute_event_context_with_maybe_missing_prevs(
seen = await self._store.have_events_in_timeline(prevs)
missing_prevs = prevs - seen

# If we've already recently attempted to pull this missing event recently, don't
# try it again so soon. Since we have to fetch all of the prev_events, we can
# bail early here if we find any to ignore.
prevs_to_ignore = await self._store.filter_event_ids_with_pull_attempt_backoff(
room_id, missing_prevs
)
erikjohnston marked this conversation as resolved.
Show resolved Hide resolved
if len(prevs_to_ignore) > 0:
raise FederationError(
"ERROR",
429,
"Not attempting to pull event that we already tried to pull recently (backing off).",
affected=prevs_to_ignore[0],
)
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

if not missing_prevs:
return await self._state_handler.compute_event_context(event)

Expand Down
72 changes: 72 additions & 0 deletions synapse/storage/databases/main/event_federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1530,6 +1530,78 @@ def _record_event_failed_pull_attempt_upsert_txn(

txn.execute(sql, (room_id, event_id, 1, self._clock.time_msec(), cause))

@trace
async def filter_event_ids_with_pull_attempt_backoff(
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
self,
room_id: str,
event_ids: Collection[str],
) -> List[str]:
"""
Filter down the events to ones that we've failed to pull before recently. Uses
exponential backoff.

Args:
room_id: The room that the events belong to
event_ids: A list of events to filter down

Returns:
List of event_ids that should not be attempted to be pulled
"""

def _filter_event_ids_with_pull_attempt_backoff_txn(
txn: LoggingTransaction,
) -> List[str]:
where_event_ids_match_clause, values = make_in_list_sql_clause(
txn.database_engine, "event_id", event_ids
)

if isinstance(self.database_engine, PostgresEngine):
least_function = "least"
elif isinstance(self.database_engine, Sqlite3Engine):
least_function = "min"
else:
raise RuntimeError("Unknown database engine")

sql = f"""
SELECT event_id FROM event_failed_pull_attempts
WHERE
room_id = ?
AND {where_event_ids_match_clause}
/**
* Exponential back-off (up to the upper bound) so we don't try to
* pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND ? /* current_time */ < last_attempt_ts + (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that these inline SQL comments get sent over the wire to postgres, which is probably not totally ideal. Might be worth moving the big ones out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, TBH I'm wondering whether for this it would be clearer to just fetch event_id, last_attempt_ts, num_attempts and do the checks in python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, TBH I'm wondering whether for this it would be clearer to just fetch event_id, last_attempt_ts, num_attempts and do the checks in python?

I think this makes a lot of sense here since we have to get a response out for each event_ids anyway 👍



It occurs to me that these inline SQL comments get sent over the wire to postgres, which is probably not totally ideal. Might be worth moving the big ones out?

In terms of other places to worry about this: we already strip out the newlines which is why I use /* */ instead of inline --. Is there a way we can strip the comments out so we don't lose the nice readability? The performance of find/replacing this every time doesn't sound great but our _make_sql_one_line doesn't look particularly optimized by eye either 🤔. I'm not sure why we started doing that (introduced in #1939), maybe to get rid of all of the extra indentation spaces in multi-line strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about that too, though we'd want to make sure that we do it safely (I'm not sure if just stripping /* ... */ is safe by itself).

Note that we don't actually send the one-line SQL to the database, we use it for logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #14192 to continue this comments in SQL conversation ⏩

Note that we don't actually send the one-line SQL to the database, we use it for logging.

huh, I swore this was a problem before. Good to know!

(1 << {least_function}(num_attempts, ? /* max doubling steps */))
* ? /* step */
)
"""

txn.execute(
sql,
(
room_id,
*values,
self._clock.time_msec(),
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_MAXIMUM_DOUBLING_STEPS,
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_MILLISECONDS,
),
)

event_ids_to_ignore_result = cast(List[Tuple[str]], txn.fetchall())

return [event_id for event_id, in event_ids_to_ignore_result]

return await self.db_pool.runInteraction(
"filter_event_ids_with_pull_attempt_backoff_txn",
_filter_event_ids_with_pull_attempt_backoff_txn,
)

async def get_missing_events(
self,
room_id: str,
Expand Down
195 changes: 194 additions & 1 deletion tests/handlers/test_federation_event.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class FederationEventHandlerTests(unittest.FederatingHomeserverTestCase):
def make_homeserver(self, reactor, clock):
# mock out the federation transport client
self.mock_federation_transport_client = mock.Mock(
spec=["get_room_state_ids", "get_room_state", "get_event"]
spec=["get_room_state_ids", "get_room_state", "get_event", "backfill"]
)
return super().setup_test_homeserver(
federation_transport_client=self.mock_federation_transport_client
Expand Down Expand Up @@ -459,6 +459,199 @@ def test_process_pulled_event_clears_backfill_attempts_after_being_successfully_
)
self.assertIsNotNone(persisted, "pulled event was not persisted at all")

def test_backfill_signature_failure_does_not_fetch_same_prev_event_later(
self,
) -> None:
"""
Test to make sure we backoff and don't try to fetch a missing prev_event when we
already know it has a invalid signature from checking the signatures of all of
the events in the backfill response.
"""
OTHER_USER = f"@user:{self.OTHER_SERVER_NAME}"
main_store = self.hs.get_datastores().main

# Create the room
user_id = self.register_user("kermit", "test")
tok = self.login("kermit", "test")
room_id = self.helper.create_room_as(room_creator=user_id, tok=tok)
room_version = self.get_success(main_store.get_room_version(room_id))

# Allow the remote user to send state events
self.helper.send_state(
room_id,
"m.room.power_levels",
{"events_default": 0, "state_default": 0},
tok=tok,
)

# Add the remote user to the room
member_event = self.get_success(
event_injection.inject_member_event(self.hs, room_id, OTHER_USER, "join")
)

initial_state_map = self.get_success(
main_store.get_partial_current_state_ids(room_id)
)

auth_event_ids = [
initial_state_map[("m.room.create", "")],
initial_state_map[("m.room.power_levels", "")],
member_event.event_id,
]

# We purposely don't run `add_hashes_and_signatures_from_other_server`
# over this because we want the signature check to fail.
pulled_event_without_signatures = make_event_from_dict(
{
"type": "test_regular_type",
"room_id": room_id,
"sender": OTHER_USER,
"prev_events": [member_event.event_id],
"auth_events": auth_event_ids,
"origin_server_ts": 1,
"depth": 12,
"content": {"body": "pulled_event_without_signatures"},
},
room_version,
)

# Create a regular event that should pass except for the
# `pulled_event_without_signatures` in the `prev_event`.
pulled_event = make_event_from_dict(
self.add_hashes_and_signatures_from_other_server(
{
"type": "test_regular_type",
"room_id": room_id,
"sender": OTHER_USER,
"prev_events": [
member_event.event_id,
pulled_event_without_signatures.event_id,
],
"auth_events": auth_event_ids,
"origin_server_ts": 1,
"depth": 12,
"content": {"body": "pulled_event"},
}
),
room_version,
)

# We expect an outbound request to /backfill, so stub that out
self.mock_federation_transport_client.backfill.return_value = make_awaitable(
{
"origin": self.OTHER_SERVER_NAME,
"origin_server_ts": 123,
"pdus": [
# This is one of the important aspects of this test: we include
# `pulled_event_without_signatures` so it fails the signature check
# when we filter down the backfill response down to events which
# have valid signatures in
# `_check_sigs_and_hash_for_pulled_events_and_fetch`
pulled_event_without_signatures.get_pdu_json(),
# Then later when we process this valid signature event, when we
# fetch the missing `prev_event`s, we want to make sure that we
# backoff and don't try and fetch `pulled_event_without_signatures`
# again since we know it just had an invalid signature.
pulled_event.get_pdu_json(),
],
}
)

# Keep track of the count and make sure we don't make any of these requests
event_endpoint_requested_count = 0
room_state_ids_endpoint_requested_count = 0
room_state_endpoint_requested_count = 0

async def get_event(
destination: str, event_id: str, timeout: Optional[int] = None
) -> None:
nonlocal event_endpoint_requested_count
event_endpoint_requested_count += 1

async def get_room_state_ids(
destination: str, room_id: str, event_id: str
) -> None:
nonlocal room_state_ids_endpoint_requested_count
room_state_ids_endpoint_requested_count += 1

async def get_room_state(
room_version: RoomVersion, destination: str, room_id: str, event_id: str
) -> None:
nonlocal room_state_endpoint_requested_count
room_state_endpoint_requested_count += 1

# We don't expect an outbound request to `/event`, `/state_ids`, or `/state` in
# the happy path but if the logic is sneaking around what we expect, stub that
# out so we can detect that failure
self.mock_federation_transport_client.get_event.side_effect = get_event
self.mock_federation_transport_client.get_room_state_ids.side_effect = (
get_room_state_ids
)
self.mock_federation_transport_client.get_room_state.side_effect = (
get_room_state
)

# The function under test: try to backfill and process the pulled event
with LoggingContext("test"):
self.get_success(
self.hs.get_federation_event_handler().backfill(
self.OTHER_SERVER_NAME,
room_id,
limit=1,
extremities=["$some_extremity"],
)
)

if event_endpoint_requested_count > 0:
self.fail(
"We don't expect an outbound request to /event in the happy path but if "
"the logic is sneaking around what we expect, make sure to fail the test. "
"We don't expect it because the signature failure should cause us to backoff "
"and not asking about pulled_event_without_signatures="
f"{pulled_event_without_signatures.event_id} again"
)

if room_state_ids_endpoint_requested_count > 0:
self.fail(
"We don't expect an outbound request to /state_ids in the happy path but if "
"the logic is sneaking around what we expect, make sure to fail the test. "
"We don't expect it because the signature failure should cause us to backoff "
"and not asking about pulled_event_without_signatures="
f"{pulled_event_without_signatures.event_id} again"
)

if room_state_endpoint_requested_count > 0:
self.fail(
"We don't expect an outbound request to /state in the happy path but if "
"the logic is sneaking around what we expect, make sure to fail the test. "
"We don't expect it because the signature failure should cause us to backoff "
"and not asking about pulled_event_without_signatures="
f"{pulled_event_without_signatures.event_id} again"
)

# Make sure we only recorded a single failure which corresponds to the signature
# failure initially in `_check_sigs_and_hash_for_pulled_events_and_fetch` before
# we process all of the pulled events.
backfill_num_attempts_for_event_without_signatures = self.get_success(
main_store.db_pool.simple_select_one_onecol(
table="event_failed_pull_attempts",
keyvalues={"event_id": pulled_event_without_signatures.event_id},
retcol="num_attempts",
)
)
self.assertEqual(backfill_num_attempts_for_event_without_signatures, 1)

# And we recorded a failure for the event that has the missing prev_event
# since we weren't able to go and fetch it.
backfill_num_attempts = self.get_success(
main_store.db_pool.simple_select_one_onecol(
table="event_failed_pull_attempts",
keyvalues={"event_id": pulled_event.event_id},
retcol="num_attempts",
)
)
self.assertEqual(backfill_num_attempts, 1)

def test_process_pulled_event_with_rejected_missing_state(self) -> None:
"""Ensure that we correctly handle pulled events with missing state containing a
rejected state event
Expand Down
Loading