Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalance swarm when necessary #34

Merged
merged 12 commits into from
Oct 12, 2022
Merged

Rebalance swarm when necessary #34

merged 12 commits into from
Oct 12, 2022

Conversation

borzunov
Copy link
Collaborator

@borzunov borzunov commented Jul 22, 2022

Status: I've tested it on simple cases, it works.

Future work:

  • Sometimes server fails to restart due to shmem errors. Rebalancing is disabled by default until we solve that.
  • Server should follow ModuleContainer status and restart it if it crashes.
  • I'd like to add some docs on properties of the greedy algorithm in block_selection.py.
  • Add unit tests for algorithms from block_selection.py.
  • (maybe) Add functional tests for real server rebalancing.

@borzunov borzunov changed the title Extract ModuleContainer class from Server Load other blocks if the swarm is imbalanced Oct 11, 2022
@borzunov borzunov force-pushed the extract-module-container branch 3 times, most recently from 5b2406d to 2308245 Compare October 11, 2022 10:34
@borzunov borzunov changed the title Load other blocks if the swarm is imbalanced Rebalance swarm when necessary Oct 11, 2022
@borzunov borzunov marked this pull request as ready for review October 11, 2022 10:45
@@ -29,76 +30,13 @@


class Server(threading.Thread):
"""Serves one or more bloom layers for inference, forward and backward; announces oneself to the DHT"""
"""
Runs ModuleContainer, periodically checks that the network is balanced,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] ModuleContainer sounds somewhat too generic for a class that announces modules to DHT, accepts requests from p2pd, etc

Maybe

  • Server -> LoadBalancedServer
  • ModuleContainer -> Server
    ?

Copy link
Collaborator Author

@borzunov borzunov Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that ModuleContainer is not a perfect name but I didn't come up with better options. If we find one, we can rename this class later.



class ModuleContainer(threading.Thread):
"""Serves a set of specific Bloom layers for inference, forward, and backward. Announces itself over the DHT."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServedLayers?

max_block_selection_delay: float = 1,
mean_block_selection_delay: float = 0.5,
mean_balance_check_period: float = 150,
min_balance_quality: float = 0.8,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
min_balance_quality: float = 0.8,
min_balance_quality: float = 0.0,

TODO: Disable rebalancing by default, unless we solve issues with shmem

@borzunov borzunov merged commit 149f433 into main Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants