Add unit test for Mixtral MoE layer #2677

pcmoritz · 2024-01-31T02:17:14Z

This is PR adds a unit test for the Mixtral MoE layer to vLLM.

It is based on @casper-hansen 's test in https://github.com/casper-hansen/AutoAWQ/blob/mixtral_fused/tests/test_fused_moe.py

pcmoritz · 2024-01-31T02:25:17Z

vllm/model_executor/models/mixtral.py

@@ -141,8 +142,9 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
                                        selected_experts,
                                        inplace=True)

-        final_hidden_states = tensor_model_parallel_all_reduce(
-            final_hidden_states)
+        if self.tp_size > 1:


I'm curious what the thoughts on this are. On the one hand there is a ton of value in being able to run all the layers in a single process without having to stand up the distributed environment, on the other hand, this is redundant with the check in tensor_model_parallel_all_reduce that checks for get_tensor_model_parallel_world_size() == 1 (but we can't use that here since get_tensor_model_parallel_world_size already needs the distributed environment). I considered monkey patching that function but that doesn't seem great either.

simon-mo · 2024-01-31T02:35:50Z

https://buildkite.com/vllm/ci/builds/744#018d5d59-27fa-4a6c-bc07-f15bd8d544fc/51-61
Looks like they failed. The docker image is public, you can reproduce them by running on L4 instances in GCP.

pcmoritz · 2024-01-31T04:08:12Z

I understand the problem, it comes from using pytorch with CUDA and pytest-forked. I think the best solution (short of using pytest-xdist instead of pytest-forked, which is a big change) is to isolate the layer wise tests from the model test and just use pytest without --forked for the layer wise tests (since these should be fast).

simon-mo · 2024-01-31T04:11:11Z

I’m fine completely not using forked here as well.

On January 30, 2024, GitHub ***@***.***> wrote: I understand the problem, it comes from using pytorch with CUDA and pytest-forked. I think the best solution (short of using pytest-xdist instead of pytest-forked, which is a big change) is to isolate the layer wise tests from the model test and just use pytest without -- forked for the layer wise tests (since these should be fast).

— Reply to this email directly, view it on GitHub <#2677 (comment)- 1918344992>, or unsubscribe <https://github.com/notifications/unsubscribe- auth/AFBD7A6OES4M5G43M3XNEBTYRG7TPAVCNFSM6AAAAABCSISFS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJYGM2DIOJZGI>. You are receiving this because you commented.Message ID: <vllm- ***@***.***>

pcmoritz · 2024-01-31T04:17:11Z

Ok sounds good let's do that then. If it gets too slow going forward, we can switch to pytest-xdist :)

pcmoritz · 2024-01-31T04:23:22Z

I also added the end-to-end test for mixtral.

EDIT: It doesn't fit into the GPU memory, could have seen that coming.

pcmoritz · 2024-01-31T05:04:49Z

I had to shift it to kernels which seems like a fine place to put it in. The models tests don't like it if somebody else is using their precious GPU memory :D

casper-hansen · 2024-01-31T07:14:08Z

I’m just curious, is there any way this implementation can run FP16 and match the original implementation without a large difference in logits?

If not, do we understand why running both in FP16 gives such a large difference?

pcmoritz · 2024-01-31T07:27:06Z

Even in bfloat16 the difference is not that large. The best precision is in float32 without tensor cores (see https://pytorch.org/docs/stable/notes/cuda.html#tf32-on-ampere) and that's the best setting to check the correctness of the algorithm. But if we want the highest performance, we have to sacrifice some accuracy -- in the future we will probably offer the option to do this arithmetic in fp8, that will be even less accurate and then we probably have to do some scaling to get good results.

I don't think this algorithm is fundamentally less numerically stable than what we had before, it is all just a big blocked matrix multiplication. The only difference is this one is more cache efficient (and does less work because we are only doing the work for the experts that are actually used).

pcmoritz · 2024-01-31T07:36:41Z

Here are the numerical differences for different dtypes (the maximum absolute difference of the states after the MOE layer):

tensor(0.0005, device='cuda:0', dtype=torch.float16)
tensor(0.0029, device='cuda:0', dtype=torch.bfloat16)
tensor(0.0008, device='cuda:0') <- dtype = float32

(also note these numbers are a little random). And with float32 without tensor cores it is:

tensor(0.0002, device='cuda:0')

pcmoritz · 2024-01-31T08:36:58Z

Btw, one more point of comparison: Even if you take just the HuggingFace implementation, look at the difference of evaluating it in float16 vs. float32:

import torch
from transformers import MixtralConfig
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock


config = MixtralConfig()
hf_moe = MixtralSparseMoeBlock(config).to("cuda")

inputs = torch.randn((1, 16, config.hidden_size)).to("cuda")

hf_states1, _ = hf_moe.forward(inputs)
hf_states2, _ = hf_moe.to(torch.float16).forward(inputs.to(torch.float16))

print("diff", torch.max(abs(hf_states1 - hf_states2.to(torch.float32))))

diff tensor(0.0003, device='cuda:0')

And this is almost the best possible case -- if you compare float16 weights on float16 inputs with bfloat16 weights on bfloat16 inputes, you get much worse (namely 0.0032 error). So we can actually be pretty happy about the accuracy we are getting. There is a decent amount of inherent inaccuracy in the problem itself :)

pcmoritz · 2024-01-31T10:34:31Z

@simon-mo This is now ready! I also fixed two failures in the kernels CI test along the way, one here (libcuda.so not found) because it is very related to this PR, and the other in #2684 which is unrelated to this PR.

pcmoritz · 2024-01-31T21:43:03Z

@casper-hansen I have also added tests for the other dtypes now :)

pcmoritz added 12 commits January 30, 2024 17:29

Add correctness test for Mixtral MoE implementation

9b6db97

move implementation and mock

23cde49

update

4458458

update

3a3d10b

update

da5b51a

update

c110b90

update

35afb07

update

5441b91

update

3e003a3

update

18539ac

clean up

7af9425

yapf

d97064b

pcmoritz commented Jan 31, 2024

View reviewed changes

pcmoritz added 2 commits January 30, 2024 18:27

yapf 2

9936001

fix

5d5864a

pcmoritz mentioned this pull request Jan 31, 2024

DeepseekMoE support with Fused MoE kernel #2453

Merged

update

ec9bb0f

update

8422480

pcmoritz changed the title ~~Add unit test for Mixtral MoE layer~~ Add end-to-end test for Mixtral and unit test for Mixtral MoE layer Jan 31, 2024

pcmoritz added 2 commits January 30, 2024 20:24

update

92f5554

merge

a185796

pcmoritz changed the title ~~Add end-to-end test for Mixtral and unit test for Mixtral MoE layer~~ Add unit test for Mixtral MoE layer Jan 31, 2024

pcmoritz added 3 commits January 30, 2024 20:42

update

b215b27

yapf

670fa46

put layers test into layers folder

49cfa05

pcmoritz added 2 commits January 30, 2024 21:03

fix

183dee9

update

165d4dc

pcmoritz added 3 commits January 30, 2024 21:05

make consistent

6d8b2bf

update

70c7dd6

fix

39fe665

fix

d975eb7

pcmoritz mentioned this pull request Jan 31, 2024

[Minor] Fix test_cache.py CI test failure #2684

Merged

pcmoritz added 4 commits January 31, 2024 11:33

Merge branch 'main' into add-moe-test-mixtral

425c4db

remove extra cuda transfer

9dce086

add more dtypes to the test

de99701

unify tests

5fb277b

yapf

d3a38a2

simon-mo approved these changes Jan 31, 2024

View reviewed changes

simon-mo merged commit d0d93b9 into vllm-project:main Jan 31, 2024
17 checks passed

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

Add unit test for Mixtral MoE layer (vllm-project#2677)

eee6ed3

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add unit test for Mixtral MoE layer (vllm-project#2677)

8ebc68e

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Add unit test for Mixtral MoE layer (vllm-project#2677)

8be0a34

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024

Add unit test for Mixtral MoE layer (vllm-project#2677)

ce22fe3

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024

Add unit test for Mixtral MoE layer (vllm-project#2677)

e7e1e76

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Add unit test for Mixtral MoE layer (vllm-project#2677)

303f4cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unit test for Mixtral MoE layer #2677

Add unit test for Mixtral MoE layer #2677

pcmoritz commented Jan 31, 2024

pcmoritz Jan 31, 2024 •

edited

Loading

simon-mo commented Jan 31, 2024

pcmoritz commented Jan 31, 2024

simon-mo commented Jan 31, 2024 via email

pcmoritz commented Jan 31, 2024

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

casper-hansen commented Jan 31, 2024

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024

pcmoritz commented Jan 31, 2024

Add unit test for Mixtral MoE layer #2677

Add unit test for Mixtral MoE layer #2677

Conversation

pcmoritz commented Jan 31, 2024

pcmoritz Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

simon-mo commented Jan 31, 2024

pcmoritz commented Jan 31, 2024

simon-mo commented Jan 31, 2024 via email

pcmoritz commented Jan 31, 2024

pcmoritz commented Jan 31, 2024 • edited Loading

pcmoritz commented Jan 31, 2024 • edited Loading

casper-hansen commented Jan 31, 2024

pcmoritz commented Jan 31, 2024 • edited Loading

pcmoritz commented Jan 31, 2024 • edited Loading

pcmoritz commented Jan 31, 2024 • edited Loading

pcmoritz commented Jan 31, 2024

pcmoritz commented Jan 31, 2024

pcmoritz Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024 •

edited

Loading