You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the leaderboard at https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard, Nemotron 4 340b Instruct is currently ranked behind Llama 3 70b Instruct. Due to the huge number of weights, a single machine with 8 A100 cannot accommodate it without quantization. The official approach is to use a single machine with 8 H100 in conjunction with fp8 to make it work. vLLM also has support plans and is expected to be quantized through fp8 or Pipeline Parallelism. Currently, both of these methods have relatively long implementation cycles on LMDeploy, and the priority for supporting this model is not so high. It may be considered holding off on this for now.
Motivation
as titled @lvhan028 @lzhangzz @grimoire
blog: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
tech report: https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf
hf:
https://huggingface.co/nvidia/Nemotron-4-340B-Base
https://huggingface.co/nvidia/Nemotron-4-340B-Instruct
https://huggingface.co/nvidia/Nemotron-4-340B-Reward
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: