The instantiation of Multi-head PA and the design choice of MAM adapter. #18

JacobYuan7 · 2022-07-03T07:42:23Z

Thanks for your great work!
I have read your paper, but I am a bit confused about two things.

(1) The instantiation of Multi-head PA. How can we instantiate Multi-head PA (r=30) to make it have the same quantity of tuned parameters as PA (attn, r=30) according to Table 4 in the main paper? My initial thought is that Multi-head PA's tuned parameters will be N_h times those of PA.

(2) The design choice of MAM adapter. According to my understanding, MH PA (attn, r = 30) is slightly better than prefix tuning (l = 30) based on the result in Table 4 (35.3>35.2), and according to previous papers like LoRA, prefix tuning is not stable to optimize. However, MAM adopts prefix tuning. Is there a specific reason for this?

Would you mind giving me any clues about these two questions?

jxhe · 2022-07-07T08:29:50Z

Thanks for your interest! For your questions:

MH PA and PA use the same number of parameters when their r are the same -- in transformers the attention output on each head is of dim d/N_h while the entire attention output is of dim d
Optimizing MH PA (attn) is similarly difficult to prefix tuning (while PA is much more stable), the number 35.3 and 35.2 are not really that different. Therefore, there is no specific reasons to adopt prefix tuning in MAM, actually I recall that adopting MH PA in MAM could give similar performance

JacobYuan7 · 2022-07-09T10:44:07Z

Thanks for your interest! For your questions:

MH PA and PA use the same number of parameters when their r are the same -- in transformers the attention output on each head is of dim d/N_h while the entire attention output is of dim d

Optimizing MH PA (attn) is similarly difficult to prefix tuning (while PA is much more stable), the number 35.3 and 35.2 are not really that different. Therefore, there is no specific reasons to adopt prefix tuning in MAM, actually I recall that adopting MH PA in MAM could give similar performance

Thanks for your reply! But I am still a bit confused about question 1.

For PA, we have parameters of size 2*d*r.
For MH PA, since the design is parallel to attention module, Adapters take in x of dimension d as input. Then, we have parameters of size (d*r+r*d/N_h)*N_h, which is not equal to the above term.

Correct me if I am wrong in the calculation. Thanks!

jxhe · 2022-07-28T15:55:30Z

Hi, sorry to getting back so late! (I am kinda in a post-graduation vacation mode recently......)

Back to your questions, in our implementation the input x in MH PA is practically xW_q on each head to be exactly comparable to prefix tuning, thus the #parameters is 2*d*r

JacobYuan7 changed the title ~~The instantiation of Multi-head PA and its parameters.~~ The instantiation of Multi-head PA and the design choice of MAM adapter. Jul 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The instantiation of Multi-head PA and the design choice of MAM adapter. #18

The instantiation of Multi-head PA and the design choice of MAM adapter. #18

JacobYuan7 commented Jul 3, 2022 •

edited

Loading

jxhe commented Jul 7, 2022

JacobYuan7 commented Jul 9, 2022 •

edited

Loading

jxhe commented Jul 28, 2022

The instantiation of Multi-head PA and the design choice of MAM adapter. #18

The instantiation of Multi-head PA and the design choice of MAM adapter. #18

Comments

JacobYuan7 commented Jul 3, 2022 • edited Loading

jxhe commented Jul 7, 2022

JacobYuan7 commented Jul 9, 2022 • edited Loading

jxhe commented Jul 28, 2022

JacobYuan7 commented Jul 3, 2022 •

edited

Loading

JacobYuan7 commented Jul 9, 2022 •

edited

Loading