-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CB improvements #769
base: master
Are you sure you want to change the base?
CB improvements #769
Conversation
auto running_sequences = sequence_group->get_running_sequences(); | ||
// TODO: ilavrenov - why beam search case is not handled here? | ||
// in case of beam search 'blocks_num' are not equally distributed between sequences | ||
// because some of them share the same blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @popovaan
// preempt prompt fully to not leave partially generated prompt | ||
preempted_tokens = processed_tokens; | ||
preempted_tokens = context_len; | ||
// TODO: ilavrenov - what is we have multiple sequences within a group? we need to drop them all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @popovaan
// token by token. This structure represents a vector of already generated tokens so far | ||
// for a given prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not exactly true. I would rephrase that to: "... a vector of tokens generated since the last read...".
To be more specific it's always vector of one element if N == 1 and we are not using beam search.
For N > 1 and/or beam search is used vector contains all generated tokens.
using GenerationOutputs = std::unordered_map<uint64_t, GenerationOutput>; | ||
|
||
class GenerationStream; | ||
|
||
class OPENVINO_GENAI_EXPORTS GenerationHandleImpl { | ||
class OPENVINO_GENAI_EXPORTS GenerationHandle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this change?
std::shared_ptr<GenerationStream> m_generation_stream; | ||
ov::genai::GenerationConfig m_sampling_params; | ||
|
||
bool is_dropped(); | ||
// whether client ha dropped session with pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// whether client ha dropped session with pipeline | |
// whether client has dropped session with pipeline |
|
||
bool can_read(); | ||
// whether new tokens are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// whether new tokens are available | |
// whether read() is possible (new tokens are available and handle has not been dropped) |
// whether to split prompt / generate to different scheduling phases | ||
// - Dynamic split fuse schdules requests in generation phase first, then | ||
// schdules requests in prompt phase. If request cannot be fully fit into | ||
// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially | ||
// and other tokens can be scheduled only next iterations | ||
// - vLLM mode priorities requests in prompt phase over requests on generation phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// whether to split prompt / generate to different scheduling phases | |
// - Dynamic split fuse schdules requests in generation phase first, then | |
// schdules requests in prompt phase. If request cannot be fully fit into | |
// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially | |
// and other tokens can be scheduled only next iterations | |
// - vLLM mode priorities requests in prompt phase over requests on generation phase | |
// whether to split prompt / generate to different scheduling phases | |
// - Dynamic split fuse schedules requests in generation phase first, then | |
// schedules requests in prompt phase. If request cannot be fully fit into | |
// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially | |
// and other tokens can be scheduled in next iterations | |
// - vLLM mode prioritizes requests in prompt phase over requests in generation phase |
No description provided.