Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement shortest-path routing for inference #362

Merged
merged 13 commits into from
Jul 18, 2023
Prev Previous commit
Next Next commit
Update comment in _has_cache_for()
  • Loading branch information
borzunov committed Jul 18, 2023
commit 4bd3c1741622c4a7d6538a3f15572075bf526de6
6 changes: 4 additions & 2 deletions src/petals/client/routing/sequence_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,8 +284,10 @@ def _has_cache_for(span: RemoteSpanInfo, cache_tokens_needed: Optional[int] = No
if cache_tokens_needed is None or span.server_info.cache_tokens_left is None:
return True

# This is a pessimistic estimate that assumes that we'll use all blocks hosted by this server,
# which is not always true. This is okay since false positives are more costly than false negatives here.
# Here, `span` contains all blocks hosted by a server - but we won't necessarily run all of them through
# this particular server in our path. It is difficult to estimate how many blocks we'll use at this stage,
# so we assume that we'll use all of them (the worst case for the cache size) and get a pessimistic estimate.
# This is okay since false positives are more costly than false negatives here.
return cache_tokens_needed * 2 * span.length <= span.server_info.cache_tokens_left
Copy link
Collaborator

@justheuristic justheuristic Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the servers should report cache tokens*layers to the DHT so we can avoid estimates? In other words, multiply whatever is reported by the number of blocks hosted.

iirc, it is a relatively new feature that can be modified without ache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already reported this way. I've updated the comment to improve clarity.


def _make_sequence_with_max_throughput(self, start_index: int, end_index: int) -> List[RemoteSpanInfo]:
Expand Down
Loading