Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce LLaMA memory usage #18181

Conversation

kunal-vaishnavi
Copy link
Contributor

Description

This PR reduces the memory usage when exporting and benchmarking LLaMA.

Motivation and Context

  • Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision.
  • Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.

frank-dong-ms
frank-dong-ms previously approved these changes Oct 31, 2023
Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

BTW, there is no need to convert torch.tensor->numpy for io_binding. You can directly use torch tensor in io binding. See example in

@kunal-vaishnavi kunal-vaishnavi merged commit d1b85f5 into microsoft:main Nov 1, 2023
81 of 85 checks passed
tianleiwu pushed a commit that referenced this pull request Nov 1, 2023
### Description
This PR reduces the memory usage when exporting and benchmarking LLaMA.



### Motivation and Context
- Exporting: The PyTorch model is deleted from memory after a successful
export instead of deleting it from memory after exporting + converting
the ONNX model to the desired precision.
- Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache
inputs use the same GPU memory for both the prompt and token generation
benchmarks.
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
This PR reduces the memory usage when exporting and benchmarking LLaMA.



### Motivation and Context
- Exporting: The PyTorch model is deleted from memory after a successful
export instead of deleting it from memory after exporting + converting
the ONNX model to the desired precision.
- Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache
inputs use the same GPU memory for both the prompt and token generation
benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants