Merge pull request openvinotoolkit#201 from Wovchena/merge-releases/2…

…023/3-into-master Merge releases/2023/3 into master
gklodkox · Jan 30, 2024 · 59edbc1 · 59edbc1
2 parents 705bf42 + 00a4a80
commit 59edbc1
Show file tree

Hide file tree

Showing 5 changed files with 23 additions and 1 deletion.
diff --git a/text_generation/causal_lm/cpp/README.md b/text_generation/causal_lm/cpp/README.md
@@ -4,6 +4,25 @@ These examples showcase inference of text-generation Large Language Models (LLMs
 
 ## How it works
 
+### Stateful LLM
+
+A common LLM inference optimisation is introduction of past KV (key/value)-cache. This cache is represented by the corresponding inputs and outputs in a model implemented originally in DL framework (e.g. PyTorch models from HuggingFace). To optimize it further and simplify usage, the model is transformed to a stateful form. This transformation improves inference performance and decreases amount of allocated runtime memory in long running text generation scenarios. It is achieved by hiding inputs and outputs of the model that represent past KV-cache tensors and handling them inside the model in a more efficient way. Although the cache is still accessible with state API. It is opposed to stateless model approach requiring manipulating these inputs and outputs explicitly.
+
+Hiding KV-cache introduces a peculiarity for beam search algorithm. Beam search suggests batched inference of multiple beams. The design described here so far would result in generating multiple independent sequences of tokens. Beam search algorithm, on the other hand, requires removing some of the ongoing beams and splitting other beams to multiple branches. Beam removal requires deleting corresponding KV-cache entry and beam splitting requires copying corresponding KV-cache values.
+
+To provide a possibility to implement beam search without accessing model internal state, a stateful LLM converted with `optimum-intel` or [llm_bench](../../../llm_bench/python/) introduces additional 1-dimentional `beam_idx` input. `beam_idx` must contain indices of elements in a batch which are supposed to be selected and evolve during next beam search iteration. Suppose there are two running beams. To proceed generating both beams at the next iteration, `beam_idx` values must be `[0, 1]`, pointing to batch elements `0` and `1`. To drop the last beam and split the other beam in two, `beam_idx` must be set to `[0, 0]`, this results in utilizing only the part of KV cache corresponding to zeroth element in the batch. The process of selecting appropriate entries in cache is called Cache Reorder.
+
+The images below represent stateless and stateful LLM pipelines. The model has 4 inputs:
+1. `input_ids` contains the next selected token
+2. `attention_mask` is filled with `1`
+3. `position_ids` encodes a position of currently generating token in the sequence
+4. `beam_idx` selects beams
+
+The model has 1 output `logits` describing the predicted distribution over the next tokens. And there's KV cache state.
+
+![](stateless.jpg)
+![](stateful.jpg)
+
 ### greedy_causal_lm
 
 The program loads a tokenizer, a detokenizer and a model (`.xml` and `.bin`) to OpenVINO. A prompt is tokenized and passed to the model. The model greedily generates token by token until the special end of sequence (EOS) token is obtained. The predicted tokens are converted to chars and printed in a streaming fashion.

diff --git a/text_generation/causal_lm/cpp/beam_search_causal_lm.cpp b/text_generation/causal_lm/cpp/beam_search_causal_lm.cpp
@@ -81,7 +81,7 @@ int main(int argc, char* argv[]) try {
     // Model is stateful which means that context (kv-cache) which belongs to a particular
     // text sequence is accumulated inside the model during the generation loop above.
     // This context should be reset before processing the next text sequence.
-    // While it is not required to reset context in this sample as only one sequence is processed,
+    // While it is not required to reset context in this sample as only one batch of sequences is processed,
     // it is called for education purposes:
     lm.reset_state();
 } catch (const std::exception& error) {

diff --git a/text_generation/causal_lm/cpp/greedy_causal_lm.cpp b/text_generation/causal_lm/cpp/greedy_causal_lm.cpp
@@ -35,6 +35,7 @@ struct TextStreamer {
             std::cout << std::string_view{text.data() + print_len, text.size() - print_len};
             token_cache.clear();
             print_len = 0;
+	    return;
         }
         if (text.size() >= 3 && text.compare(text.size() - 3, 3, "�") == 0) {
             // Don't print incomplete text
@@ -76,6 +77,8 @@ int main(int argc, char* argv[]) try {
     position_ids.set_shape(input_ids.get_shape());
     std::iota(position_ids.data<int64_t>(), position_ids.data<int64_t>() + position_ids.get_size(), 0);
     constexpr size_t BATCH_SIZE = 1;
+    // Input values are persistent between inference calls.
+    // That allows to set values, which aren't going to change, only once
     lm.get_tensor("beam_idx").set_shape({BATCH_SIZE});
     lm.get_tensor("beam_idx").data<int32_t>()[0] = 0;
     lm.infer();

diff --git a/text_generation/causal_lm/cpp/stateful.jpg b/text_generation/causal_lm/cpp/stateful.jpg
diff --git a/text_generation/causal_lm/cpp/stateless.jpg b/text_generation/causal_lm/cpp/stateless.jpg