Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support Nemotron-4 340B #1784

Open
zhyncs opened this issue Jun 15, 2024 · 3 comments
Open

[Feature] support Nemotron-4 340B #1784

zhyncs opened this issue Jun 15, 2024 · 3 comments

Comments

@zhyncs
Copy link
Collaborator

zhyncs commented Jun 15, 2024

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jun 15, 2024

The model weights on Hugging Face is a little wired.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jul 3, 2024

According to the leaderboard at https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard, Nemotron 4 340b Instruct is currently ranked behind Llama 3 70b Instruct. Due to the huge number of weights, a single machine with 8 A100 cannot accommodate it without quantization. The official approach is to use a single machine with 8 H100 in conjunction with fp8 to make it work. vLLM also has support plans and is expected to be quantized through fp8 or Pipeline Parallelism. Currently, both of these methods have relatively long implementation cycles on LMDeploy, and the priority for supporting this model is not so high. It may be considered holding off on this for now.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jul 21, 2024

Multi node or quant is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant