Fixes selection of cuDNN algorithm #15881

syed-ahmed · 2019-01-09T19:30:51Z

Summary:

This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.

Changelist:

Changed the template arguments to be perf_t instead of algo_t to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of getAlgorithm.
Used cudnnGet*_v7 everywhere cudnnGet* was being used.
Removed all cudnn6 paths (This PR depends on Remove support for CUDNN 6 #15851)

CC: @ngimel @csarofeen for review

Nvprof:

Before: On the left is the timeline for cudnnFind and on the right it is for cudnnGet. In the cudnnFind timeline, the first few kernel calls are for the find algorithm. It is seen here that the algo chosen is not the fastest one (because chosen algo1 is run with mathType 1).
After: Compared to the timeline above, when the mathType is updated, we see that the fastest algo is chosen indeed. (In 7.4 algo1+mathType 0 is giving better perf)

Timing:

Current pytorch nightly container: 0.006510331630706787
With this patch: 0.004045917987823487
Speedup: ~1.5x

Code used in analysis:

import torch
import time
import torch.nn as nn
print("cudnn version", torch.backends.cudnn.version())

conv = nn.Conv2d(16,256,(3,3),dilation=(2,2), padding=(1,1)).cuda().half()
torch.backends.cudnn.benchmark=True

input = torch.randn(64,16,56,56, device="cuda").half()
out=conv(input)
gO=torch.rand_like(out)
torch.cuda.synchronize()

s = time.time()
for i in range(100):
   out = conv(input.detach())
   out.backward(gO)
torch.cuda.synchronize()
e = time.time()
print((e-s)/100)

ezyang · 2019-01-14T04:02:38Z

aten/src/ATen/native/cudnn/Conv.cpp

+
+  // update convDesc mathType since cudnn now requires both algo + mathType to figure out
+  // whether to use Tensor cores or not
+  AT_CUDNN_CHECK(cudnnSetConvolutionMathType(args.cdesc.mut_desc(), fwdAlgPerf.mathType));


I'm a little confused here: why don't you modify the convolution descriptor before calling chooseAlgorithm?

That is because the calls to cudnnSetConvolutionMathType before chooseAlgorithm, doesn't matter. I needed to only set the convolution math type with the resultant math type of chooseAlgorithm instead of modifying anything else, and hence, deleted the calls to cudnnSetConvolutionMathType in the setter method of the descriptor and just used it directly after the chooseAlgorithm in all the subsequent tensors. Turns out, the mathType returned by the chooseAlgorithm can be different from what we set before in the setter and hence, we have to explicitly update it after the chooseAlgorithm has found the best pair of algorithm+mathType. Otherwise, even though we'll be calling cudnnConvolutionForward with the fastest algorithm, under the hood, cudnn will run it with the slower kernel since it sees fastest algorithm combination with a sub optimal mathType.

OK, this is a very good comment to put in a Note [chooseAlgorithm doesn't respect mathType] and reference from all of these sites :)

ezyang · 2019-01-14T04:04:15Z

aten/src/ATen/cudnn/Descriptors.h

    AT_CUDNN_CHECK(cudnnSetConvolutionGroupCount(mut_desc(), groups));
-    AT_CUDNN_CHECK(cudnnSetConvolutionMathType(mut_desc(), CUDNN_DEFAULT_MATH));


I guess, what I expected, was to see mathType added as an argument to this method, and then a set to cudnnSetConvolutionMathType here (instead of twiddling it in multiple locations further down this diff, where you could easily forget to put one site in.)

Yeah, so I finally actually reviewed the patch, and I understand why I am confused.

The invariant that I want to see out of this code, is that the default convolution descriptor (when I call set() on it) has "good" settings. By deleting these lines, we end up with a descriptor that is default everything, even though we know that in the absence of benchmarking, we do subsequently toggle CUDNN_TENSOR_OP_MATH on for half inputs. So, to me, it seems like it would be better to set this here, and then delete the code that does the equivalent modification at

pytorch/aten/src/ATen/native/cudnn/Conv.cpp

Line 706 in fc574b5

algoPerf->algo = search::DEFAULT_ALGO;

It also saves you from having to insert this code in multiple code paths.

What do you think?

That's totally doable. I was debating between whether to call cudnnSetConvolutionMathType multiple times vs putting those extra lines regarding default mathtype and algo. But since cudnnSetConvolutionMathType is cheap, let's bring those lines back and get rid of the multiple code paths regarding mathType.

Please see the latest changes (014b7b9). I think it addresses both of our confusion?

No, I am still confused. In the latest patch, you are still not setting cudnnSetConvolutionMathType inside ConvolutionDescriptor, instead, you're returning cudnnMathType_t via pointer argument, which is weird and pretty out of line with how the rest of the descriptors work.

Also, I don't understand why it is necessary to pre-fill fwdAlgPerf.mathType with a math type, in your code. The invariant I would expect is chooseAlgorithm always leaves fwdAlgPerf in a valid state. Is the problem this code?

if (args.params.deterministic && !benchmark) { *algo = search::DEFAULT_ALGO; return; }

I don't know, but I could imagine that the DEFAULT_ALGO here isn't half aware and so you get the wrong math type in that case? Sure, but in that case, I would expect up the fixup for half case to occur here.

cannot call cudnnSetConvolutionMathType in the descriptor's setter, because the cudnnSetConvolutionMathType called after chooseAlgorithm would overwrite the effect of the one in the descriptor's setter, and it would overwrite it with garbage for the default path if I don't have the following in chooseAlgorithm.

No, I don't agree with this. Look at the code here:

args.cdesc.set(dataType, input.dim() - 2, args.params.padding, args.params.stride, args.params.dilation, args.params.groups); // TODO: when we do legacy group convolution support, we'll repeatedly // reinitialize the workspace for each convolution we do. This is // wasteful; we'd rather reuse the workspace. OTOH, legacy group // convolution support is already pretty slow, so this might not // matter. (This applies to raw_cudnn_convolution_backward_input as well.) cudnnConvolutionFwdAlgo_t fwdAlg; Workspace workspace = chooseAlgorithm(args, benchmark, &fwdAlg);

The set method on descriptor is clearly called before chooseAlgorithm. I additionally audited every other occurrence of set( in native/cudnn/Conv.cpp to convince myself that ConvolutionDescriptor::set is always called before chooseAlgorithm in all codepaths.

I certainly agree with you that in the event that benchmarking occurs, we may need to subsequently call cudnnSetConvolutionMathType after benchmarking, to update it (thus leading to an extra, technically unnecessary, setting of math type.) For interest of code clarity, I am recommending that you do this extra work.

I didn't respond to your paragraphs after this claim. Let me know if I should.

Ok. I prepared a diff for you based on your most recent reply (syed-ahmed@0499daa) but I'm embarrassed to say that I've discovered a cudnn bug! :'(

The bug is, if I call cudnnSetConvolutionMathType in the setter of the descriptor (which sets the mathType to CUDNN_TENSOR_OP when fp16), cudnnGet*_v7 returns algo1 with CUDNN_TENSOR_OP math type, instead of not caring about what was set by cudnnSetConvolutionMathType before it (and returning algo1 with CUDNN_DEFAULT_MATH which is performant)!

Any suggestions on how can I move forward? I have filed a bug internally.

Interesting. Sorry about sending you on this wild goose chase!

In that case, I think something similar to an earlier version of your patch would be preferred (before you started changing it due to my comments). Here's the new justification:

The job of ConvolutionDescriptor::set is to setup a convolution descriptor so that algorithm search works for it. Since setting math type to CUDNN_TENSOR_OP prevents cuDNN from considering non-tensor-op algorithms, that means it's correct for it to unconditionally set CUDNN_DEFAULT_MATH

The job of findAlgorithm is to update ConvolutionDescriptor with correct, performant parameters, even if we never actually call the cudnn benchmarking function. And as the code is structured right now, you'll have to write the code to adjust the math setting in multiple places. I guess it's not worth fixing now.

So... maybe just reverting to e5038f9 and adding some comments from our investigation is sufficient? Once again, sorry about the wild goose chase!

Oh no no. It's all good. Your points are legit. I'll put the comments in for now and hopefully we can address this again in the future.

ezyang · 2019-01-14T04:04:38Z

Yes please!

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions. Changelist: - Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`. - Used cudnnGet*_v7 everywhere cudnnGet* was being used. - Removed all cudnn6 paths (This PR depends on #15851) Differential Revision: D13787601 Pulled By: ezyang fbshipit-source-id: 81fe86727673d021306fe1c99c3e528b7c9ad17f

bhack · 2019-01-23T23:40:33Z

Had this some impact on float() algo selection other than the half() example?

syed-ahmed · 2019-01-26T00:07:40Z

@bhack There wasn't any impact on float() because we were doing cudnnSetConvolutionMathType with CUDNN_DEFAULT_MATH for float() case and cudnnGet/Find was giving algo1. As a result, even though we weren't calling cudnnSetConvolutionMathType to update the cdesc mathType after cudnnGet/Find, we were still using the performant algo+mathType combo. Following is a before and after nvprof for float() for your reference.

cudnnFind path:

cudnnGet path:

This reverts commit 2ebb8fd.

This reverts commit 3837446.

Summary: There is a regression in cudnnGet*_v7 that causes slowdown in resnet50 training. I am opening a bug with cuDNN team about this. This reverts commit 3837446. ezyang 😿 Pull Request resolved: #16484 Differential Revision: D13924755 Pulled By: soumith fbshipit-source-id: 8c719345fc443f1289539bfae630eea9224ba4a5

This reverts commit 014b7b9.

ngimel · 2019-02-02T05:58:09Z

aten/src/ATen/native/cudnn/Conv.cpp

+  // update convDesc mathType since cudnn 7.4+ now requires both algo + mathType to figure out
+  // whether to use Tensor core kernels or not
+  // See Note [behavior of cudnnFind and cudnnGet]
+  AT_CUDNN_CHECK(cudnnSetConvolutionMathType(args.cdesc.mut_desc(), perfResults[best_algo_idx].mathType));


This won't work. This will set correct math type only the first time, when you are actually running find. All the subsequent times cdesc will be created with the default math type (default for fp32, tensor cores for fp16), and it won't ever be changed.

Gotcha, I see what you are saying. I was trying to address @ezyang 's request on removing the following code blocks and this can either stay like this or I could make the setter take perf_t and do this there once.

if (dataType == CUDNN_DATA_HALF) { algoPerf->mathType = CUDNN_TENSOR_OP_MATH; } else { algoPerf->mathType = CUDNN_DEFAULT_MATH; }

Co-authored-by: Natalia Gimelshein <ngimelshein@nvidia.com>

ezyang · 2019-02-02T15:42:14Z

aten/src/ATen/native/cudnn/Conv.cpp

+// When cudnnSetConvolutionMathType is called before cudnnGet/cudnnFind, it informs 
+// cudnnGet/cudnnFind to iterate/take into account both tensor core and non-tensor-core algos.
+// If you don't call cudnnSetConvolutionMathType before calling cudnnGet/cudnnFind,
+// cudnnGet/cudnnFind may not pick tensor core algos.


I'm planning to accept this patch as is, but one thing I don't quite get from the description here is whether or not the MATH parameter you pass in matters or not.

It does matter as it turns out. Should I word this differently (I don't think one can decipher this from the cuDNN docs either)? For instance, before this patch, since I didn't pass CUDNN_TENSOR_OP_MATH, cudnnFind/cudnnGet was not picking the algorithms that enable tensor cores.

syed-ahmed · 2019-02-03T05:00:00Z

Here are the resnet50 numbers with top of tree. Will update with MaskRCNN numbers as soon as I get them:

Resnet-50, FP16, 8xV100-16GB, batch-size: 256
---------------------------------------------
5689 images/second without PR
5912 images/second with PR

ezyang · 2019-02-04T03:34:53Z

I attempted to read the algo selection code a few times this weekend, and just gave up lol. I'm going to just go ahead and merge this and trust you guys ;)

…g-fix

ngimel · 2019-02-04T18:37:55Z

@ezyang but you wrote it, or at least moved it from the previous incarnation of cudnn bindings :-)

ezyang · 2019-02-05T16:13:40Z

I plead copy paste!

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions. Changelist: - Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`. - Used cudnnGet*_v7 everywhere cudnnGet* was being used. - Removed all cudnn6 paths (This PR depends on #15851) Differential Revision: D13957944 Pulled By: ezyang fbshipit-source-id: a88c39d80ae37f2d686665622302b62b50fab404

ezyang reviewed Jan 14, 2019

View reviewed changes

syed-ahmed force-pushed the cudnn-bug-fix branch from 20647b8 to fc574b5 Compare January 17, 2019 22:31

syed-ahmed added a commit to syed-ahmed/pytorch that referenced this pull request Jan 18, 2019

Fixes selection of cuDNN algorithm (pytorch#15881)

a93949d

soumith pushed a commit that referenced this pull request Jan 18, 2019

Fixes selection of cuDNN algorithm (#15881)

2ebb8fd

ezyang approved these changes Jan 23, 2019

View reviewed changes

facebook-github-bot reviewed Jan 23, 2019

View reviewed changes

syed-ahmed closed this Jan 26, 2019

syed-ahmed deleted the cudnn-bug-fix branch January 26, 2019 00:08

syed-ahmed added a commit to syed-ahmed/pytorch that referenced this pull request Jan 29, 2019

Revert "Fixes selection of cuDNN algorithm (pytorch#15881)"

0573efc

This reverts commit 2ebb8fd.

syed-ahmed restored the cudnn-bug-fix branch January 29, 2019 03:58

syed-ahmed reopened this Jan 29, 2019

syed-ahmed added a commit to syed-ahmed/pytorch that referenced this pull request Jan 29, 2019

Revert "Fixes selection of cuDNN algorithm (pytorch#15881)"

23ac2c8

This reverts commit 3837446.

syed-ahmed added 5 commits February 1, 2019 21:17

Fixes selection of cudnn algorithm and removes cudnn 6 path

5580a70

Adds note regarding mathType

9f090ce

Add more review changes.

2f4cb51

Revert "Add more review changes."

48819ee

This reverts commit 014b7b9.

Revert and add comments about bug

62b39b6

syed-ahmed force-pushed the cudnn-bug-fix branch from 1a35ceb to c621e9a Compare February 2, 2019 05:17

ngimel reviewed Feb 2, 2019

View reviewed changes

Adding changes from Natalia

436aea6

Co-authored-by: Natalia Gimelshein <ngimelshein@nvidia.com>

syed-ahmed force-pushed the cudnn-bug-fix branch from c621e9a to 436aea6 Compare February 2, 2019 07:56

ezyang reviewed Feb 2, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into syed-ahmed-cudnn-bu…

db455c2

…g-fix

facebook-github-bot reviewed Feb 5, 2019

View reviewed changes

syed-ahmed closed this Feb 5, 2019

ezyang added open source labels Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes selection of cuDNN algorithm #15881

Fixes selection of cuDNN algorithm #15881

syed-ahmed commented Jan 9, 2019 •

edited

Loading

ezyang Jan 14, 2019

syed-ahmed Jan 14, 2019

ezyang Jan 17, 2019

syed-ahmed Jan 17, 2019

ezyang Jan 14, 2019

ezyang Jan 18, 2019

syed-ahmed Jan 18, 2019

syed-ahmed Jan 18, 2019

ezyang Jan 22, 2019

ezyang Jan 22, 2019 •

edited

Loading

syed-ahmed Jan 23, 2019 •

edited

Loading

ezyang Jan 23, 2019

syed-ahmed Jan 23, 2019

syed-ahmed Jan 23, 2019

ezyang commented Jan 14, 2019

facebook-github-bot left a comment

bhack commented Jan 23, 2019

syed-ahmed commented Jan 26, 2019

ngimel Feb 2, 2019

syed-ahmed Feb 2, 2019

ezyang Feb 2, 2019

syed-ahmed Feb 3, 2019

syed-ahmed commented Feb 3, 2019

ezyang commented Feb 4, 2019

ngimel commented Feb 4, 2019

ezyang commented Feb 5, 2019

facebook-github-bot left a comment

		AT_CUDNN_CHECK(cudnnSetConvolutionGroupCount(mut_desc(), groups));
		AT_CUDNN_CHECK(cudnnSetConvolutionMathType(mut_desc(), CUDNN_DEFAULT_MATH));

Fixes selection of cuDNN algorithm #15881

Fixes selection of cuDNN algorithm #15881

Conversation

syed-ahmed commented Jan 9, 2019 • edited Loading

Summary:

Nvprof:

Timing:

Code used in analysis:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang Jan 22, 2019 • edited Loading

Choose a reason for hiding this comment

syed-ahmed Jan 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Jan 14, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

bhack commented Jan 23, 2019

syed-ahmed commented Jan 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

syed-ahmed commented Feb 3, 2019

ezyang commented Feb 4, 2019

ngimel commented Feb 4, 2019

ezyang commented Feb 5, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

syed-ahmed commented Jan 9, 2019 •

edited

Loading

ezyang Jan 22, 2019 •

edited

Loading

syed-ahmed Jan 23, 2019 •

edited

Loading