CB improvements #769

ilya-lavrenov · 2024-08-13T19:07:40Z

No description provided.

ilya-lavrenov · 2024-08-13T19:15:02Z

src/cpp/src/block_manager.hpp

+        auto running_sequences = sequence_group->get_running_sequences();
+        // TODO: ilavrenov - why beam search case is not handled here?
+        // in case of beam search 'blocks_num' are not equally distributed between sequences
+        // because some of them share the same blocks


CC @popovaan

ilya-lavrenov · 2024-08-13T19:15:16Z

src/cpp/src/scheduler.hpp

            // preempt prompt fully to not leave partially generated prompt
-            preempted_tokens = processed_tokens;
+            preempted_tokens = context_len;
+            // TODO: ilavrenov - what is we have multiple sequences within a group? we need to drop them all


CC @popovaan

mzegla · 2024-08-14T10:00:16Z

src/cpp/include/openvino/genai/generation_handle.hpp

+// token by token. This structure represents a vector of already generated tokens so far
+// for a given prompt.


That's not exactly true. I would rephrase that to: "... a vector of tokens generated since the last read...".
To be more specific it's always vector of one element if N == 1 and we are not using beam search.
For N > 1 and/or beam search is used vector contains all generated tokens.

mzegla · 2024-08-14T10:02:45Z

src/cpp/include/openvino/genai/generation_handle.hpp

 using GenerationOutputs = std::unordered_map<uint64_t, GenerationOutput>;

 class GenerationStream;

-class OPENVINO_GENAI_EXPORTS GenerationHandleImpl {
+class OPENVINO_GENAI_EXPORTS GenerationHandle {


Why do we need this change?

mzegla · 2024-08-14T10:03:16Z

src/cpp/include/openvino/genai/generation_handle.hpp

    std::shared_ptr<GenerationStream> m_generation_stream;
    ov::genai::GenerationConfig m_sampling_params;

-    bool is_dropped();
+    // whether client ha dropped session with pipeline


Suggested change

// whether client ha dropped session with pipeline

// whether client has dropped session with pipeline

mzegla · 2024-08-14T10:05:10Z

src/cpp/include/openvino/genai/generation_handle.hpp


-    bool can_read();
+    // whether new tokens are available


Suggested change

// whether new tokens are available

// whether read() is possible (new tokens are available and handle has not been dropped)

mzegla · 2024-08-14T10:18:34Z

src/cpp/include/openvino/genai/scheduler_config.hpp

    // whether to split prompt / generate to different scheduling phases
+    // - Dynamic split fuse schdules requests in generation phase first, then
+    // schdules requests in prompt phase. If request cannot be fully fit into
+    // remaining space of 'max_num_batched_tokens' group, it's scheduled only partially
+    // and other tokens can be scheduled only next iterations
+    // - vLLM mode priorities requests in prompt phase over requests on generation phase


Suggested change

// whether to split prompt / generate to different scheduling phases

// - Dynamic split fuse schdules requests in generation phase first, then

// schdules requests in prompt phase. If request cannot be fully fit into

// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially

// and other tokens can be scheduled only next iterations

// - vLLM mode priorities requests in prompt phase over requests on generation phase

// whether to split prompt / generate to different scheduling phases

// - Dynamic split fuse schedules requests in generation phase first, then

// schedules requests in prompt phase. If request cannot be fully fit into

// remaining space of 'max_num_batched_tokens' group, it's scheduled only partially

// and other tokens can be scheduled in next iterations

// - vLLM mode prioritizes requests in prompt phase over requests in generation phase

CB improvements

f78dda3

ilya-lavrenov commented Aug 13, 2024

View reviewed changes

ilya-lavrenov added this to the 2024.4 milestone Aug 13, 2024

ilya-lavrenov assigned mzegla and popovaan Aug 13, 2024

mzegla reviewed Aug 14, 2024

View reviewed changes

andrei-kochin removed this from the 2024.4 milestone Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CB improvements #769

CB improvements #769

ilya-lavrenov commented Aug 13, 2024

ilya-lavrenov Aug 13, 2024

ilya-lavrenov Aug 13, 2024

mzegla Aug 14, 2024

mzegla Aug 14, 2024

mzegla Aug 14, 2024

mzegla Aug 14, 2024

mzegla Aug 14, 2024

		// token by token. This structure represents a vector of already generated tokens so far
		// for a given prompt.

	// whether client ha dropped session with pipeline
	// whether client has dropped session with pipeline

	// whether new tokens are available
	// whether read() is possible (new tokens are available and handle has not been dropped)

CB improvements #769

Are you sure you want to change the base?

CB improvements #769

Conversation

ilya-lavrenov commented Aug 13, 2024

ilya-lavrenov Aug 13, 2024

Choose a reason for hiding this comment

ilya-lavrenov Aug 13, 2024

Choose a reason for hiding this comment

mzegla Aug 14, 2024

Choose a reason for hiding this comment

mzegla Aug 14, 2024

Choose a reason for hiding this comment

mzegla Aug 14, 2024

Choose a reason for hiding this comment

mzegla Aug 14, 2024

Choose a reason for hiding this comment

mzegla Aug 14, 2024

Choose a reason for hiding this comment