Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federation readers regularly get stuck at 100% CPU #10698

Closed
turt2live opened this issue Aug 25, 2021 · 5 comments · Fixed by #10703
Closed

Federation readers regularly get stuck at 100% CPU #10698

turt2live opened this issue Aug 25, 2021 · 5 comments · Fixed by #10703
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@turt2live
Copy link
Member

Description

Federation reader workers regularly get stuck at 100% CPU for not explicable reason (graphs don't reveal why it's spinning CPU). Investigation is needed into what it is doing, and why this is happening.

The impact of the readers being down generally doesn't affect the local server itself, but does affect every other server's ability to interact with that server, such as joining rooms, looking up profiles, inviting other members, etc. The failure ends up making the other servers look bad when the cause isn't their fault :(

Steps to reproduce

Unclear

Version information

  • Homeserver: matrix.org
@erikjohnston
Copy link
Member

It seems completely wedged merging dicts. Python stack trace from GDB:

Traceback (most recent call first):
  File "/home/synapse/src/synapse/storage/databases/main/events_worker.py", line 583, in _get_events_from_cache_or_db
    event_entry_map.update(result)
  File "/home/synapse/src/synapse/storage/databases/main/events_worker.py", line 383, in get_events_as_list
    set(event_ids), allow_rejected=allow_rejected
  File "/home/synapse/src/synapse/storage/databases/main/event_federation.py", line 93, in get_auth_chain
    return await self.get_events_as_list(event_ids)
  File "/home/synapse/src/synapse/federation/federation_server.py", line 604, in on_send_join_request
    auth_chain = await self.store.get_auth_chain(room_id, state_ids)
  File "/home/synapse/src/synapse/federation/transport/server/federation.py", line 358, in on_PUT
    result = await self.handler.on_send_join_request(origin, content, room_id)

C stack trace:

#0  dk_get_index (i=<optimized out>, keys=<optimized out>) at Objects/dictobject.c:842
#1  lookdict_unicode_nodummy (value_addr=0x7fff0ad2f408, hash=7255247804242703278, key=0x7fc776e99220, mp=0x7fc7d428d048) at Objects/dictobject.c:845
#2  lookdict_unicode_nodummy (mp=<optimized out>, key=<optimized out>, hash=<optimized out>, value_addr=<optimized out>, mp=<optimized out>, key=<optimized out>, hash=<optimized out>, 
    value_addr=<optimized out>) at Objects/dictobject.c:826
#3  0x000055c499a283ce in insertdict (mp=0x7fc7d428d048, key=0x7fc776e99220, hash=7255247804242703278, value=0x7fc2e2e13e98) at Objects/dictobject.c:1015
#4  0x000055c499af308f in dict_merge (a=0x7fc7d428d048, b=0x7fc3a57bf898, override=1) at Objects/dictobject.c:2428
#5  0x000055c499af327b in PyDict_Merge (override=1, b=<optimized out>, a=0x7fc7d428d048) at Objects/dictobject.c:2513
#6  dict_update_common (methname=0x55c499bcfb10 "update", kwds=0x0, args=<optimized out>, self=0x7fc7d428d048) at Objects/dictobject.c:2250
#7  dict_update (self=0x7fc7d428d048, args=<optimized out>, kwds=0x0) at Objects/dictobject.c:2272

@erikjohnston
Copy link
Member

erikjohnston commented Aug 25, 2021

Py-spy:

Screenshot from 2021-08-25 21-55-37

@erikjohnston
Copy link
Member

erikjohnston commented Aug 25, 2021

Nothing was logged after it wedged, which suggests it was making zero forward progress.

@anoadragon453
Copy link
Member

matrix.org is on commit fe3466a.

Room was !xPRJVyOhRymgpkarYZ:matrix.org, aka #freenode_#freenode:matrix.org. The operation was part of getting the auth events on a /send_join from archlinux.org (Synapse 1.40.0).

@reivilibre reivilibre added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter and removed T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Aug 26, 2021
@anoadragon453
Copy link
Member

Federation readers 1 & 2 both experienced the same issue at roughly 10am UTC today. The latest logs from the processes:

Federation reader 1

2021-08-26 10:00:39,369 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-91522393 - Caught up with stream 'events' to 2261887448
2021-08-26 10:00:39,404 - synapse.access.http.8084 - 410 - INFO - GET-13210389 - 178.254.52.123 - 8084 - {asra.gr} Processed request: 0.010sec/-0.000sec (0.001sec, 0.000sec) (0.001sec/0.002sec/2) 654B 200 "GET /_matrix/federation/v1/user/devices/%40matterbridge1%3Amatrix.org HTTP/1.1" "Synapse/1.41.0" [0 dbevts]
2021-08-26 10:00:39,406 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140529943617816 - Handling 'POSITION events event_persister-1 2261887448 2261887449'
2021-08-26 10:00:39,406 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-91522399 - Caught up with stream 'events' to 2261887449
2021-08-26 10:00:39,408 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140529943617816 - Handling 'POSITION events event_persister-3 2261887448 2261887449'
2021-08-26 10:00:39,408 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-91522400 - Caught up with stream 'events' to 2261887449
2021-08-26 10:00:39,429 - synapse.access.http.8084 - 410 - INFO - GET-13210390 - 2a02:c207:2014:8709::1 - 8084 - {matrix.virtualpudding.org} Processed request: 0.010sec/-0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/2) 2723B 200 "GET /_matrix/federation/v1/user/devices/%40Kobberholm%3Amatrix.org HTTP/1.1" "Synapse/1.31.0" [0 dbevts]
2021-08-26 10:00:39,432 - synapse.access.http.8084 - 410 - INFO - GET-13210376 - 2003:a:b4c:7301::4:130 - 8084 - {matrix02.m31.zone} Processed request: 0.507sec/-0.000sec (0.003sec, 0.000sec) (0.000sec/0.001sec/1) 1005B 200 "GET /_matrix/federation/v1/event/$Nr9pSLjHPMLgwboRXBENaubm-cXZcMJswYMMZfIYiE8 HTTP/1.1" "Dendrite/0.4.1+125ea75b" [0 dbevts]
2021-08-26 10:00:39,442 - synapse.access.http.8084 - 410 - INFO - GET-13210391 - 165.227.224.215 - 8084 - {project-nex.net} Processed request: 0.009sec/-0.000sec (0.003sec, 0.000sec) (0.001sec/0.003sec/3) 4812B 200 "GET /_matrix/federation/v1/user/devices/%40dinosaurdynasty%3Amatrix.org HTTP/1.1" "Synapse/1.37.1" [0 dbevts]
2021-08-26 10:00:39,470 - synapse.access.http.8084 - 410 - INFO - GET-13210377 - 2003:a:b4c:7301::4:130 - 8084 - {matrix02.m31.zone} Processed request: 0.521sec/-0.000sec (0.003sec, 0.001sec) (0.001sec/0.009sec/3) 963B 200 "GET /_matrix/federation/v1/event/$bWnGzbOBCt-4dzGLDf7m1DkM419ns3f3v0f3wASTPm8 HTTP/1.1" "Dendrite/0.4.1+125ea75b" [1 dbevts]
2021-08-26 14:52:36,505 - root - 336 - WARNING - main - ***** STARTING SERVER *****

Federation reader 2

2021-08-26 10:01:41,918 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-2 2261889087 2261889088'
2021-08-26 10:01:41,919 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625949 - Caught up with stream 'events' to 2261889088
2021-08-26 10:01:41,931 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-2 2261889088 2261889090'
2021-08-26 10:01:41,932 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625953 - Caught up with stream 'events' to 2261889090
2021-08-26 10:01:41,946 - synapse.9533_debug - 435 - DEBUG - process-replication-data-90625956 - to-device messages stream id 315805302, awaking streams for ['@nelsond:matrix.org']
2021-08-26 10:01:41,956 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-2 2261889090 2261889091'
2021-08-26 10:01:41,956 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625957 - Caught up with stream 'events' to 2261889091
2021-08-26 10:01:42,004 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-1 2261889092 2261889093'
2021-08-26 10:01:42,005 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625965 - Caught up with stream 'events' to 2261889093
2021-08-26 14:52:36,506 - root - 336 - WARNING - main - ***** STARTING SERVER *****

We restarted each which got them going again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants