Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Postgres locking #1651

Merged
merged 16 commits into from
Sep 5, 2024
Merged

Implement Postgres locking #1651

merged 16 commits into from
Sep 5, 2024

Conversation

r4victor
Copy link
Collaborator

@r4victor r4victor commented Sep 2, 2024

Closes #1632

This PR implements database-level locking for handling concurrent processing in multi-replica server with Postgres setting. More specifically, it:

  • Implements resource locking via select for update.
  • Refactors code to be suitable for select for update locking. This includes replacing joinedload's, refetching attributes, ensuring the locks held for the entire transaction duration, splitting submitted jobs processing into two steps (assign instance or no instance/provision instance).
  • Refactors in-memory locking to replace custom per-table locks (*_LOCK) with a unified locker interface.
  • Fixes bugs with in-memory locking.
  • Adds contributing guide on locking implementation.

Tested on SQLite:

  • Launch runs on new instances
  • Launch runs on existing instances
  • Stop runs
  • Delete runs
  • Create/delete cloud fleets
  • Create/delete ssh fleets
  • Create/attach/detach/delete volumes
  • Create/delete gateways
  • Run multi-replica services

Tested on Postgres:

  • Launch runs on new instances (multi-replica server)
  • Launch runs on existing instances (multi-replica server)
  • Launch many concurrent runs (multi-replica server)
  • Stop runs (multi-replica server)
  • Create/delete cloud fleets (multi-replica server)
  • Create/attach/detach/delete volumes (multi-replica server)
  • Create/delete gateways (single replica server)
  • Run multi-replica services (single replica server)

@r4victor r4victor marked this pull request as ready for review September 3, 2024 11:22
Copy link
Collaborator

@jvstme jvstme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4victor, thanks for this work. I found 3 unresolved TODOs/FIXMEs apart from compatibility TODOs. Are they all fine to merge?

contributing/LOCKING.md Outdated Show resolved Hide resolved
)
instances = res.all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why discard processing multiple instances at once? Isn't it much faster than waiting for the next scheduled process_instances task in 4-6 seconds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the PR, instances were processed concurrently with asyncio.as_completed which cannot be done when the processing happens in the scope of one transaction (it's not safe to share an sqlalchemy session between coroutines). Processing multiple instances both concurrently and sequentially is also problematic because instance processing can take time (e.g. remote instance waiting) which would lead to blocking processing. The current approach tolerates some instances being slow to process. It's consistent with other processing tasks.

If the processing throughput/latency becomes a problem, we could just decrease processing interval / increase max_instances.

@@ -0,0 +1,68 @@
"""Add InstanceModel.last_processed_at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) And jobs.instance_assigned

contributing/LOCKING.md Outdated Show resolved Hide resolved
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
@r4victor
Copy link
Collaborator Author

r4victor commented Sep 5, 2024

3 unresolved TODOs/FIXMEs

Those left were present before the changes and are not critical.

@r4victor r4victor merged commit e366836 into master Sep 5, 2024
18 checks passed
@r4victor r4victor deleted the issue_1632_multireplica_server branch September 5, 2024 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-replica server
3 participants