Skip to content

Commit

Permalink
Merge pull request openvinotoolkit#201 from Wovchena/merge-releases/2…
Browse files Browse the repository at this point in the history
…023/3-into-master

Merge releases/2023/3 into master
  • Loading branch information
Wovchena committed Jan 30, 2024
2 parents 705bf42 + 00a4a80 commit 59edbc1
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 1 deletion.
19 changes: 19 additions & 0 deletions text_generation/causal_lm/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,25 @@ These examples showcase inference of text-generation Large Language Models (LLMs

## How it works

### Stateful LLM

A common LLM inference optimisation is introduction of past KV (key/value)-cache. This cache is represented by the corresponding inputs and outputs in a model implemented originally in DL framework (e.g. PyTorch models from HuggingFace). To optimize it further and simplify usage, the model is transformed to a stateful form. This transformation improves inference performance and decreases amount of allocated runtime memory in long running text generation scenarios. It is achieved by hiding inputs and outputs of the model that represent past KV-cache tensors and handling them inside the model in a more efficient way. Although the cache is still accessible with state API. It is opposed to stateless model approach requiring manipulating these inputs and outputs explicitly.

Hiding KV-cache introduces a peculiarity for beam search algorithm. Beam search suggests batched inference of multiple beams. The design described here so far would result in generating multiple independent sequences of tokens. Beam search algorithm, on the other hand, requires removing some of the ongoing beams and splitting other beams to multiple branches. Beam removal requires deleting corresponding KV-cache entry and beam splitting requires copying corresponding KV-cache values.

To provide a possibility to implement beam search without accessing model internal state, a stateful LLM converted with `optimum-intel` or [llm_bench](../../../llm_bench/python/) introduces additional 1-dimentional `beam_idx` input. `beam_idx` must contain indices of elements in a batch which are supposed to be selected and evolve during next beam search iteration. Suppose there are two running beams. To proceed generating both beams at the next iteration, `beam_idx` values must be `[0, 1]`, pointing to batch elements `0` and `1`. To drop the last beam and split the other beam in two, `beam_idx` must be set to `[0, 0]`, this results in utilizing only the part of KV cache corresponding to zeroth element in the batch. The process of selecting appropriate entries in cache is called Cache Reorder.

The images below represent stateless and stateful LLM pipelines. The model has 4 inputs:
1. `input_ids` contains the next selected token
2. `attention_mask` is filled with `1`
3. `position_ids` encodes a position of currently generating token in the sequence
4. `beam_idx` selects beams

The model has 1 output `logits` describing the predicted distribution over the next tokens. And there's KV cache state.

![](stateless.jpg)
![](stateful.jpg)

### greedy_causal_lm

The program loads a tokenizer, a detokenizer and a model (`.xml` and `.bin`) to OpenVINO. A prompt is tokenized and passed to the model. The model greedily generates token by token until the special end of sequence (EOS) token is obtained. The predicted tokens are converted to chars and printed in a streaming fashion.
Expand Down
2 changes: 1 addition & 1 deletion text_generation/causal_lm/cpp/beam_search_causal_lm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ int main(int argc, char* argv[]) try {
// Model is stateful which means that context (kv-cache) which belongs to a particular
// text sequence is accumulated inside the model during the generation loop above.
// This context should be reset before processing the next text sequence.
// While it is not required to reset context in this sample as only one sequence is processed,
// While it is not required to reset context in this sample as only one batch of sequences is processed,
// it is called for education purposes:
lm.reset_state();
} catch (const std::exception& error) {
Expand Down
3 changes: 3 additions & 0 deletions text_generation/causal_lm/cpp/greedy_causal_lm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ struct TextStreamer {
std::cout << std::string_view{text.data() + print_len, text.size() - print_len};
token_cache.clear();
print_len = 0;
return;
}
if (text.size() >= 3 && text.compare(text.size() - 3, 3, "") == 0) {
// Don't print incomplete text
Expand Down Expand Up @@ -76,6 +77,8 @@ int main(int argc, char* argv[]) try {
position_ids.set_shape(input_ids.get_shape());
std::iota(position_ids.data<int64_t>(), position_ids.data<int64_t>() + position_ids.get_size(), 0);
constexpr size_t BATCH_SIZE = 1;
// Input values are persistent between inference calls.
// That allows to set values, which aren't going to change, only once
lm.get_tensor("beam_idx").set_shape({BATCH_SIZE});
lm.get_tensor("beam_idx").data<int32_t>()[0] = 0;
lm.infer();
Expand Down
Binary file added text_generation/causal_lm/cpp/stateful.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added text_generation/causal_lm/cpp/stateless.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 59edbc1

Please sign in to comment.