Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add missing worker types to Complement worker-mode runs #14871

Closed
wants to merge 4 commits into from

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented Jan 19, 2023

We were missing some possible worker types from the list that is used in CI and complement.sh. This will likely cause CI to be slower, but comes with the benefit of more complete integration test coverage. Surprisingly, it looks to have not made it much slower. The action in this PR took 38m 13s, whereas the equivalent action in #14869 took 37m 18s. That was tested when workers were failing to start up. The actual deploy times look to have gone up significantly with this change.

'account_data' being missing from this list caused CI to not catch #14869. As such, CI is expected to fail until that PR is merged. merged.

We were missing some possible worker types from the list that is used
in CI and complement.sh. This will likely cause CI to be slower, but
comes with the benefit of more complete test coverage.

'account_data' being missing from this list caused CI to not catch
#14869.
@realtyem
Copy link
Contributor

realtyem commented Jan 19, 2023

Looks like this is bringing up TestJumpToDateEndpoint and TestSearch as additional failures.

May need to bump the Postgres database connection limit to avoid false negatives.

Adding RUN echo "ALTER SYSTEM SET max_connections = '500'" | gosu postgres postgres --single into the dockerfile for Complement should be enough.

# We also initialize the database at build time, rather than runtime, so that it's faster to spin up the image.
RUN gosu postgres initdb --locale=C --encoding=UTF-8 --auth-host password

I put mine right under here so it is inherited into the synapse database that gets created.
Manifests as:

A bunch of these
synchrotron1 | 2023-01-19 17:18:36,766 - synapse.http.client - 460 - INFO - GET-10 - Received response to POST http://localhost:18024/_synapse/replication/presence_set_state/%40alice%3Ahs1: 200
synchrotron1 | 2023-01-19 17:18:36,805 - synapse.http.server - 124 - ERROR - GET-10 - Failed handle request via 'SyncRestServlet': <SynapseRequest at 0x7f88b65d2340 method='GET' uri='/_matrix/client/v3/sync?timeout=1000' clientproto='HTTP/1.0' site='18019'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/caches/response_cache.py", line 252, in cb
    return await callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 371, in _wait_for_sync_for_user
    result: SyncResult = await self.current_sync_for_user(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 422, in current_sync_for_user
    sync_result = await self.generate_sync_result(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1416, in generate_sync_result
    ) = await self._generate_sync_entry_for_rooms(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1870, in _generate_sync_entry_for_rooms
    await concurrently_execute(handle_room_entries, room_entries, 10)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 247, in concurrently_execute
    await yieldable_gather_results(
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 304, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 233, in _concurrently_execute_inner
    await maybe_awaitable(func(value))
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1859, in handle_room_entries
    await self._generate_room_entry(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 2304, in _generate_room_entry
    batch = await self._load_filtered_recents(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 597, in _load_filtered_recents
    events, end_key = await self.store.get_recent_events_for_room(
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/databases/main/stream.py", line 713, in get_recent_events_for_room
    rows, token = await self.get_recent_event_ids_for_room(
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/databases/main/stream.py", line 743, in get_recent_event_ids_for_room
    rows, token = await self.db_pool.runInteraction(
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 882, in runInteraction
    return await delay_cancellation(_runInteraction())
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 849, in _runInteraction
    result = await self.runWithConnection(
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 977, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 244, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 280, in _runWithConnection
    conn = self.connectionFactory(self)
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 34, in __init__
    self.reconnect()
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 71, in reconnect
    self._connection = self._pool.connect()
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 410, in connect
    conn = self.dbapi.connect(*self.connargs, **self.connkw)
  File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  sorry, too many clients already

Followed by:

One of these
synchrotron1 | 2023-01-19 17:18:36,811 - synapse.access.http.18019 - 460 - INFO - GET-10 - ::ffff:127.0.0.1 - 18019 - {@alice:hs1} Processed request: 0.060sec/0.004sec (0.015sec, 0.001sec) (0.045sec/0.008sec/9) 55B 500 "GET /_matrix/client/v3/sync?timeout=1000 HTTP/1.0" "Go-http-client/1.1" [0 dbevts]
synchrotron1 | 2023-01-19 17:18:36,854 - synapse.metrics.background_process_metrics - 244 - ERROR - fetch_events-7 - Background process 'fetch_events' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/synapse/metrics/background_process_metrics.py", line 242, in run
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/databases/main/events_worker.py", line 1060, in _fetch_thread
    await self.db_pool.runWithConnection(self._fetch_loop)
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 977, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 244, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 280, in _runWithConnection
    conn = self.connectionFactory(self)
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 34, in __init__
    self.reconnect()
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 71, in reconnect
    self._connection = self._pool.connect()
  File "/usr/local/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 410, in connect
    conn = self.dbapi.connect(*self.connargs, **self.connkw)
  File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  sorry, too many clients already

Default on Postgres is 100 connections.

  • Was 14 workers(plus master), which at a minimum of 5 connections opened is 75. At maximum of 10 connections that was 150 and was (mostly) never hit.
  • Now there is 19 workers(plus master), so the minimum would be 100 and max would be 200. This will be hit a lot more often(especially on synchrotron).

Then there is the reserved connections Postgres likes to keep to(3 I think?).

@anoadragon453
Copy link
Member Author

Thanks @realtyem, your suggestion of upping the allowed connected count has removed any sorry, too many clients already errors from the logs.

Current CI failures are due to #14869 not being merged yet.

Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is good to go. Any reason for keeping this in draft?

@clokep
Copy link
Member

clokep commented Feb 16, 2023

Looks like this is good to go. Any reason for keeping this in draft?

I think the CI failure aren't taken care of still: #14871 (comment)

@DMRobertson
Copy link
Contributor

Oh, I see. I thought that referred to "just" a flake, but I guess it's atest failure that's been exposed by this change(?)

@clokep
Copy link
Member

clokep commented Feb 16, 2023

Oh, I see. I thought that referred to "just" a flake, but I guess it's atest failure that's been exposed by this change(?)

Yeah, I don't think it is a flake, but a failure!

@anoadragon453
Copy link
Member Author

anoadragon453 commented Mar 3, 2023

CI is passing, now that #14869 has merged. However it looks like the time it takes to run Complement worker tests nearly doubles with this change.

The time it typically takes for the Complement Synapse container to be deployed when starting a new Complement test run is ~12s: https://github.com/matrix-org/synapse/actions/runs/4322697303/jobs/7545484048#step:6:7831

After this change, we're seeing more like ~22-23s: https://github.com/matrix-org/synapse/actions/runs/4322703753/jobs/7545484103#step:6:7990

This is significant, as this bump is applied to every test. So while typical runtimes for a Complement workers test run is ~36m. With this change, the run time becomes ~1h.

Still, I think this is worth it for correctness - though it would be nice to pair it with some performance work to maintain (relatively) low running times for Complement worker runs in CI.

@anoadragon453 anoadragon453 marked this pull request as ready for review March 3, 2023 12:05
@anoadragon453 anoadragon453 requested a review from a team as a code owner March 3, 2023 12:05
@anoadragon453
Copy link
Member Author

Elegantly superseded by #14921 🎉

@anoadragon453 anoadragon453 deleted the anoa/complement_worker_add_types branch March 17, 2023 10:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants