Reduce LLaMA memory usage #18181

kunal-vaishnavi · 2023-10-31T03:50:52Z

Description

This PR reduces the memory usage when exporting and benchmarking LLaMA.

Motivation and Context

Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision.
Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.

onnxruntime/python/tools/transformers/models/llama/llama_inputs.py

onnxruntime/python/tools/transformers/models/llama/benchmark.py

tianleiwu

LGTM.

BTW, there is no need to convert torch.tensor->numpy for io_binding. You can directly use torch tensor in io binding. See example in

onnxruntime/onnxruntime/python/tools/transformers/io_binding_helper.py

Line 211 in 2b95e74

class CudaSession:

### Description This PR reduces the memory usage when exporting and benchmarking LLaMA. ### Motivation and Context - Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision. - Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.

Reduce memory usage during export and benchmarking

53a6cef

github-advanced-security bot found potential problems Oct 31, 2023

View reviewed changes

Add changes suggested by linter

3c02b99

kunal-vaishnavi added release:1.16.2 labels Oct 31, 2023

Fix CodeQL error

094e58c

frank-dong-ms reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/llama_inputs.py Show resolved Hide resolved

frank-dong-ms reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark.py Show resolved Hide resolved

kunal-vaishnavi added 2 commits October 31, 2023 17:48

Update max sequence length

d998c3c

Add changes from PR feedback

428280e

frank-dong-ms reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark.py Outdated Show resolved Hide resolved

frank-dong-ms previously approved these changes Oct 31, 2023

View reviewed changes

Update max sequence lengths

019556e

kunal-vaishnavi dismissed frank-dong-ms’s stale review via 019556e October 31, 2023 20:01

Remove max sequence length as optional argument

f2c61d8

tianleiwu approved these changes Nov 1, 2023

View reviewed changes

kunal-vaishnavi merged commit d1b85f5 into microsoft:main Nov 1, 2023
81 of 85 checks passed

tianleiwu removed release:1.16.2 labels Nov 1, 2023

ekmixon mentioned this pull request Apr 6, 2024

[Snyk] Security upgrade eslint from 7.25.0 to 9.0.0 ekmixon/onnxruntime#192

Open

MaxMood96 mentioned this pull request Apr 6, 2024

[Snyk] Security upgrade eslint from 7.25.0 to 9.0.0 MaxMood96/onnxruntime#449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce LLaMA memory usage #18181

Reduce LLaMA memory usage #18181

kunal-vaishnavi commented Oct 31, 2023

tianleiwu left a comment •

edited

Loading

Reduce LLaMA memory usage #18181

Reduce LLaMA memory usage #18181

Conversation

kunal-vaishnavi commented Oct 31, 2023

Description

Motivation and Context

tianleiwu left a comment • edited Loading

Choose a reason for hiding this comment

tianleiwu left a comment •

edited

Loading