Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims #9734

centwang · 2021-11-11T10:57:15Z

Current FusedMatMul can support only Transpose on last 2 dims. When the 2-D arrays for MatMul is the 1st and last dims, and the batch dims are contiguous in the original tensor, we can also use GemmStridedBatched to calculate without doing the Transpose. The perm pattern in the Transpose is like [1,2,0,3] or [1,2,3,0]. This PR is to support these cases using FusedMatMul.

For perf comparison using a module with Add+EinSum("ks,ksm->sm")+MSELOss, K = 16, S = 7840, M = 2048, before the changes, it's ~7ms for each step, after the changes, it's ~4.5ms for each step, which has similar perf as PyTorch.

Using ULR-XL (16 layers) for perf test, before the changes, the execution graph has 195 Transpose nodes, 16 MatMul nodes and 306 FusedMatMul nodes. After the changes the numbers are: 131 Transpose nodes and 322 FusedMatMul nodes. From nvvp profiling, for each step, the execution time reduces from ~913ms to ~882ms, which have ~4%. The gain is from the reduce of Transpose compute, and the new fused FusedMatMul nodes use GemmStridedBatched, which has comparible perf as original MatMul node.s

pengwa · 2021-12-13T12:57:33Z

Current FusedMatMul can support only Transpose on last 2 dims. When the 2-D arrays for MatMul is the 1st and last dims, and the batch dims are contiguous in the original tensor, we can also use GemmStridedBatched to calculate without doing the Transpose. The perm pattern in the Transpose is like [1,2,0,3] or [1,2,3,0]. This PR is to support these cases using FusedMatMul.

For perf comparison using a module with Add+EinSum("ks,ksm->sm")+MSELOss, K = 16, S = 7840, M = 2048, before the changes, it's ~7ms for each step, after the changes, it's ~4.5ms for each step, which has similar perf as PyTorch.

Using ULR-XL (16 layers) for perf test, before the changes, the execution graph has 195 Transpose nodes, 16 MatMul nodes and 306 FusedMatMul nodes. After the changes the numbers are: 131 Transpose nodes and 322 FusedMatMul nodes. From nvvp profiling, for each step, the execution time reduces from ~913ms to ~882ms, which have ~4%. The gain is from the reduce of Transpose compute, and the new fused FusedMatMul nodes use GemmStridedBatched, which has comparible perf as original MatMul node.s

This is a nice change!! Some site notes put here FYI. It is found APEX and other libs I investigated last week also do this trick, the trick is applied to the models having self-attention's input having shape [seq, batch, num_head, head_dim]. We would remove at least two transposes + a scaling multiple (sqrt(num_head)) for the BERT large case. @iK1D @SherlockNoMad

pengwa · 2021-12-15T06:53:22Z

onnxruntime/core/optimizer/matmul_transpose_fusion.cc

-static Node* GetTransposeNodeFromOutput(Graph& graph, NodeArg& node_arg) {
+// is_trans is whether to transpose the 2 dims used to MatMul.
+// is_trans_batch is whether to transpose 1st dim and batch dims (dim-1 to dim-rank-2).
+// For example:


it would be nice if we can give a more descriptive comments covering what exact cases we target to fuse.

An example FYI

onnxruntime/onnxruntime/core/optimizer/bias_softmax_fusion.cc

Line 85 in 9109678

/* Here we check input and mask dimensions are as expected:

and we need a definition for the 'batch' here

I think it is better to use a different word than "batch" because it is used with respect to training batch. May be something like "range" may be okay.

may be "circular permutation" is more clear.
1->0, 2->1, ..,r->r-1, 0->r.

CUDA's APIs (GemmBatched, GemmStridedBatched) use the same name. Our MatMul code also calls them batches. I think we still call batch here, but add more comments to explain.

pengwa · 2021-12-15T07:04:59Z

onnxruntime/core/optimizer/matmul_transpose_fusion.cc

-static Node* GetTransposeNodeFromOutput(Graph& graph, NodeArg& node_arg) {
+// is_trans is whether to transpose the 2 dims used to MatMul.
+// is_trans_batch is whether to transpose 1st dim and batch dims (dim-1 to dim-rank-2).
+// For example:


and we need a definition for the 'batch' here

pengwa · 2021-12-15T07:06:37Z

onnxruntime/core/optimizer/matmul_transpose_fusion.cc

  }

-  if (!is_trans_on_last_two_dims) {
-    return nullptr;
+  // Transpose node can be fused to MatMul when the batch dimensions have same order before and after transpose.


nit: change to "the batch dims keep same relative orders before and after transpose"?

Introducing the notion of "circular permutation" is really helpful to understand the code here.

pengwa · 2021-12-15T07:40:09Z

onnxruntime/core/optimizer/matmul_transpose_fusion.cc

+// is_trans is whether to transpose the 2 dims used to MatMul.
+// is_trans_batch is whether to transpose 1st dim and batch dims (dim-1 to dim-rank-2).
+// For example:
+// is_trans=False, is_trans_batch=False: [0,1,2,3]


we should not do the fusion for [0.1.2.3,..] case, right

onnxruntime/core/providers/cpu/math/matmul.h

pengwa · 2021-12-15T08:17:29Z

onnxruntime/core/providers/cpu/math/matmul_helper.h

+    left_ld_factor_ = right_ld_factor_ = 1;
+
+    if (trans_batch_a || trans_batch_b) {
+      ORT_ENFORCE(left_num_dims > 2 && left_num_dims == right_num_dims, "Two input should have same rank and rank >= 3 if transBatchA or transBatchB is true");


Suggested change

ORT_ENFORCE(left_num_dims > 2 && left_num_dims == right_num_dims, "Two input should have same rank and rank >= 3 if transBatchA or transBatchB is true");

ORT_ENFORCE(left_num_dims > 2 && left_num_dims == right_num_dims, "Two inputs should have same rank and rank >= 3 if transBatchA or transBatchB is true");

onnxruntime/core/providers/cpu/math/matmul_helper.h

onnxruntime/core/providers/rocm/math/matmul.cc

pengwa · 2021-12-17T02:57:50Z

The change looks great overall! There are few things, need your help for confirmation:

In your measured case, how many transpose get eliminated per layer? from 195 to 131 for 16 layers, so it is 4 transpose per layer?
have we ever covered the backward pass for the fused matmul?

centwang · 2021-12-17T08:14:36Z

The change looks great overall! There are few things, need your help for confirmation:

In your measured case, how many transpose get eliminated per layer? from 195 to 131 for 16 layers, so it is 4 transpose per layer?

have we ever covered the backward pass for the fused matmul?

I didn't check the big graph carefully, but from the number yes it's 4 for each layer. From the code the fusion is added for both training and inference transformer list, so ideally it backward is also covered. But we build the gradient graph after the training transformers, and use FusedMatMul instead of MatMul in backward graph, so I think it's rare to have such case in backward we can fuse.

satyajandhyala · 2021-12-17T21:59:01Z

This is good.
Going by the definition at https://mathworld.wolfram.com/CyclicPermutation.html. This is applying cyclic permutation to left by 1. Is it possible to extend the idea further, generalize for any number r less than dimensions?

centwang · 2021-12-20T02:53:38Z

This is good. Going by the definition at https://mathworld.wolfram.com/CyclicPermutation.html. This is applying cyclic permutation to left by 1. Is it possible to extend the idea further, generalize for any number r less than dimensions?

I don't quite get the idea. Could you please give some example? i.e., what's the 'perm' attribute for the Transpose nodes.

satyajandhyala · 2021-12-20T18:06:28Z

This is good. Going by the definition at https://mathworld.wolfram.com/CyclicPermutation.html. This is applying cyclic permutation to left by 1. Is it possible to extend the idea further, generalize for any number r less than dimensions?

I don't quite get the idea. Could you please give some example? i.e., what's the 'perm' attribute for the Transpose nodes.

This change supports [1,2,0,3] or [1,2,3,0]. In future, not in this PR, could we consider permutations like, [2, 0, 1, 3] or [2,3,0,1] also?

centwang · 2021-12-21T06:39:06Z

This is good. Going by the definition at https://mathworld.wolfram.com/CyclicPermutation.html. This is applying cyclic permutation to left by 1. Is it possible to extend the idea further, generalize for any number r less than dimensions?

I don't quite get the idea. Could you please give some example? i.e., what's the 'perm' attribute for the Transpose nodes.

This change supports [1,2,0,3] or [1,2,3,0]. In future, not in this PR, could we consider permutations like, [2, 0, 1, 3] or [2,3,0,1] also?

I have below comments in the code to explain which cases we can fuse. For [2,0,1,3] or [2,3,0,1], it's not possible to get the strideA, strideB, lda, ldb for the parameters of GemmStridedBatched, so we cannot fuse such cases.

// Transpose node can be fused to MatMul when the batch dims keep same relative orders before and after transpose.
// But if they are not contiguous, after the fusion, we can only use GemmBatched instead of GemmStridedBatched,
// which may have perf issue. To keep it simple, we will fuse only when batch dimensions are contiguous.

pengwa

Sorry for the late response! LGTM!! :)

onnxruntime/core/providers/cpu/math/matmul.h

fusedmatmul support transpose batches

f51acb9

centwang added the training issues related to ONNX Runtime training; typically submitted using template label Nov 11, 2021

centwang requested review from SherlockNoMad and weixingzhang November 11, 2021 10:57

centwang added 3 commits November 12, 2021 11:12

fix win build

5c1e684

fix contrib op md

28c71bf

Merge branch 'master' into weicwang/fusedmatmul

5dc1928

pengwa reviewed Dec 15, 2021

View reviewed changes

Merge branch 'master' into weicwang/fusedmatmul

82547f5

more comments

a8ba68a

pengwa approved these changes Dec 23, 2021

View reviewed changes

onnxruntime/core/providers/cpu/math/matmul.h Show resolved Hide resolved

centwang merged commit ceb17f8 into master Dec 27, 2021

centwang deleted the weicwang/fusedmatmul branch December 27, 2021 02:49

snnn mentioned this pull request May 26, 2023

[Build] Cuda Failure 716:misaligned address when building onnxruntime with Cuda #15981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims #9734

Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims #9734

centwang commented Nov 11, 2021

pengwa commented Dec 13, 2021

pengwa Dec 15, 2021

pengwa Dec 15, 2021

satyajandhyala Dec 16, 2021

satyajandhyala Dec 17, 2021 •

edited

Loading

centwang Dec 17, 2021

pengwa Dec 15, 2021

pengwa Dec 15, 2021

satyajandhyala Dec 16, 2021 •

edited

Loading

pengwa Dec 15, 2021

pengwa Dec 15, 2021

pengwa commented Dec 17, 2021 •

edited

Loading

centwang commented Dec 17, 2021

satyajandhyala commented Dec 17, 2021 •

edited

Loading

centwang commented Dec 20, 2021

satyajandhyala commented Dec 20, 2021 •

edited

Loading

centwang commented Dec 21, 2021

pengwa left a comment

	ORT_ENFORCE(left_num_dims > 2 && left_num_dims == right_num_dims, "Two input should have same rank and rank >= 3 if transBatchA or transBatchB is true");
	ORT_ENFORCE(left_num_dims > 2 && left_num_dims == right_num_dims, "Two inputs should have same rank and rank >= 3 if transBatchA or transBatchB is true");

Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims #9734

Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims #9734

Conversation

centwang commented Nov 11, 2021

pengwa commented Dec 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satyajandhyala Dec 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satyajandhyala Dec 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengwa commented Dec 17, 2021 • edited Loading

centwang commented Dec 17, 2021

satyajandhyala commented Dec 17, 2021 • edited Loading

centwang commented Dec 20, 2021

satyajandhyala commented Dec 20, 2021 • edited Loading

centwang commented Dec 21, 2021

pengwa left a comment

Choose a reason for hiding this comment

satyajandhyala Dec 17, 2021 •

edited

Loading

satyajandhyala Dec 16, 2021 •

edited

Loading

pengwa commented Dec 17, 2021 •

edited

Loading

satyajandhyala commented Dec 17, 2021 •

edited

Loading

satyajandhyala commented Dec 20, 2021 •

edited

Loading