Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes selection of cuDNN algorithm #15881

Closed
wants to merge 7 commits into from

Conversation

syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Jan 9, 2019

Summary:

This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.

Changelist:

  • Changed the template arguments to be perf_t instead of algo_t to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of getAlgorithm.
  • Used cudnnGet*_v7 everywhere cudnnGet* was being used.
  • Removed all cudnn6 paths (This PR depends on Remove support for CUDNN 6 #15851)

CC: @ngimel @csarofeen for review

Nvprof:

  • Before: On the left is the timeline for cudnnFind and on the right it is for cudnnGet. In the cudnnFind timeline, the first few kernel calls are for the find algorithm. It is seen here that the algo chosen is not the fastest one (because chosen algo1 is run with mathType 1). before
  • After: Compared to the timeline above, when the mathType is updated, we see that the fastest algo is chosen indeed. (In 7.4 algo1+mathType 0 is giving better perf) after

Timing:

  • Current pytorch nightly container: 0.006510331630706787
  • With this patch: 0.004045917987823487
  • Speedup: ~1.5x

Code used in analysis:

import torch
import time
import torch.nn as nn
print("cudnn version", torch.backends.cudnn.version())

conv = nn.Conv2d(16,256,(3,3),dilation=(2,2), padding=(1,1)).cuda().half()
torch.backends.cudnn.benchmark=True

input = torch.randn(64,16,56,56, device="cuda").half()
out=conv(input)
gO=torch.rand_like(out)
torch.cuda.synchronize()

s = time.time()
for i in range(100):
   out = conv(input.detach())
   out.backward(gO)
torch.cuda.synchronize()
e = time.time()
print((e-s)/100)


// update convDesc mathType since cudnn now requires both algo + mathType to figure out
// whether to use Tensor cores or not
AT_CUDNN_CHECK(cudnnSetConvolutionMathType(args.cdesc.mut_desc(), fwdAlgPerf.mathType));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here: why don't you modify the convolution descriptor before calling chooseAlgorithm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because the calls to cudnnSetConvolutionMathType before chooseAlgorithm, doesn't matter. I needed to only set the convolution math type with the resultant math type of chooseAlgorithm instead of modifying anything else, and hence, deleted the calls to cudnnSetConvolutionMathType in the setter method of the descriptor and just used it directly after the chooseAlgorithm in all the subsequent tensors. Turns out, the mathType returned by the chooseAlgorithm can be different from what we set before in the setter and hence, we have to explicitly update it after the chooseAlgorithm has found the best pair of algorithm+mathType. Otherwise, even though we'll be calling cudnnConvolutionForward with the fastest algorithm, under the hood, cudnn will run it with the slower kernel since it sees fastest algorithm combination with a sub optimal mathType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is a very good comment to put in a Note [chooseAlgorithm doesn't respect mathType] and reference from all of these sites :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added :)

AT_CUDNN_CHECK(cudnnSetConvolutionGroupCount(mut_desc(), groups));
AT_CUDNN_CHECK(cudnnSetConvolutionMathType(mut_desc(), CUDNN_DEFAULT_MATH));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, what I expected, was to see mathType added as an argument to this method, and then a set to cudnnSetConvolutionMathType here (instead of twiddling it in multiple locations further down this diff, where you could easily forget to put one site in.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I finally actually reviewed the patch, and I understand why I am confused.

The invariant that I want to see out of this code, is that the default convolution descriptor (when I call set() on it) has "good" settings. By deleting these lines, we end up with a descriptor that is default everything, even though we know that in the absence of benchmarking, we do subsequently toggle CUDNN_TENSOR_OP_MATH on for half inputs. So, to me, it seems like it would be better to set this here, and then delete the code that does the equivalent modification at

algoPerf->algo = search::DEFAULT_ALGO;
It also saves you from having to insert this code in multiple code paths.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally doable. I was debating between whether to call cudnnSetConvolutionMathType multiple times vs putting those extra lines regarding default mathtype and algo. But since cudnnSetConvolutionMathType is cheap, let's bring those lines back and get rid of the multiple code paths regarding mathType.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the latest changes (014b7b9). I think it addresses both of our confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am still confused. In the latest patch, you are still not setting cudnnSetConvolutionMathType inside ConvolutionDescriptor, instead, you're returning cudnnMathType_t via pointer argument, which is weird and pretty out of line with how the rest of the descriptors work.

Also, I don't understand why it is necessary to pre-fill fwdAlgPerf.mathType with a math type, in your code. The invariant I would expect is chooseAlgorithm always leaves fwdAlgPerf in a valid state. Is the problem this code?

  if (args.params.deterministic && !benchmark) {
    *algo = search::DEFAULT_ALGO;
    return;
  }

I don't know, but I could imagine that the DEFAULT_ALGO here isn't half aware and so you get the wrong math type in that case? Sure, but in that case, I would expect up the fixup for half case to occur here.

Copy link
Contributor

@ezyang ezyang Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot call cudnnSetConvolutionMathType in the descriptor's setter, because the cudnnSetConvolutionMathType called after chooseAlgorithm would overwrite the effect of the one in the descriptor's setter, and it would overwrite it with garbage for the default path if I don't have the following in chooseAlgorithm.

No, I don't agree with this. Look at the code here:

  args.cdesc.set(dataType, input.dim() - 2, args.params.padding, args.params.stride, args.params.dilation, args.params.groups);

  // TODO: when we do legacy group convolution support, we'll repeatedly
  // reinitialize the workspace for each convolution we do.  This is
  // wasteful; we'd rather reuse the workspace.  OTOH, legacy group
  // convolution support is already pretty slow, so this might not
  // matter.  (This applies to raw_cudnn_convolution_backward_input as well.)
  cudnnConvolutionFwdAlgo_t fwdAlg;
  Workspace workspace = chooseAlgorithm(args, benchmark, &fwdAlg);

The set method on descriptor is clearly called before chooseAlgorithm. I additionally audited every other occurrence of set( in native/cudnn/Conv.cpp to convince myself that ConvolutionDescriptor::set is always called before chooseAlgorithm in all codepaths.

I certainly agree with you that in the event that benchmarking occurs, we may need to subsequently call cudnnSetConvolutionMathType after benchmarking, to update it (thus leading to an extra, technically unnecessary, setting of math type.) For interest of code clarity, I am recommending that you do this extra work.

I didn't respond to your paragraphs after this claim. Let me know if I should.

Copy link
Collaborator Author

@syed-ahmed syed-ahmed Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I prepared a diff for you based on your most recent reply (syed-ahmed@0499daa) but I'm embarrassed to say that I've discovered a cudnn bug! :'(

The bug is, if I call cudnnSetConvolutionMathType in the setter of the descriptor (which sets the mathType to CUDNN_TENSOR_OP when fp16), cudnnGet*_v7 returns algo1 with CUDNN_TENSOR_OP math type, instead of not caring about what was set by cudnnSetConvolutionMathType before it (and returning algo1 with CUDNN_DEFAULT_MATH which is performant)!

Any suggestions on how can I move forward? I have filed a bug internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Sorry about sending you on this wild goose chase!

In that case, I think something similar to an earlier version of your patch would be preferred (before you started changing it due to my comments). Here's the new justification:

  1. The job of ConvolutionDescriptor::set is to setup a convolution descriptor so that algorithm search works for it. Since setting math type to CUDNN_TENSOR_OP prevents cuDNN from considering non-tensor-op algorithms, that means it's correct for it to unconditionally set CUDNN_DEFAULT_MATH
  2. The job of findAlgorithm is to update ConvolutionDescriptor with correct, performant parameters, even if we never actually call the cudnn benchmarking function. And as the code is structured right now, you'll have to write the code to adjust the math setting in multiple places. I guess it's not worth fixing now.

So... maybe just reverting to e5038f9 and adding some comments from our investigation is sufficient? Once again, sorry about the wild goose chase!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no no. It's all good. Your points are legit. I'll put the comments in for now and hopefully we can address this again in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. :)

@ezyang
Copy link
Contributor

ezyang commented Jan 14, 2019

Yes please!

syed-ahmed added a commit to syed-ahmed/pytorch that referenced this pull request Jan 18, 2019
soumith pushed a commit that referenced this pull request Jan 18, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Jan 23, 2019
Summary:
This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.

Changelist:
- Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`.
- Used cudnnGet*_v7 everywhere cudnnGet* was being used.
- Removed all cudnn6 paths (This PR depends on #15851)

Differential Revision: D13787601

Pulled By: ezyang

fbshipit-source-id: 81fe86727673d021306fe1c99c3e528b7c9ad17f
@bhack
Copy link
Contributor

bhack commented Jan 23, 2019

Had this some impact on float() algo selection other than the half() example?

@syed-ahmed
Copy link
Collaborator Author

@bhack There wasn't any impact on float() because we were doing cudnnSetConvolutionMathType with CUDNN_DEFAULT_MATH for float() case and cudnnGet/Find was giving algo1. As a result, even though we weren't calling cudnnSetConvolutionMathType to update the cdesc mathType after cudnnGet/Find, we were still using the performant algo+mathType combo. Following is a before and after nvprof for float() for your reference.

cudnnFind path:
screenshot from 2019-01-25 15-54-20
cudnnGet path:
screenshot from 2019-01-25 16-01-56

@syed-ahmed syed-ahmed closed this Jan 26, 2019
@syed-ahmed syed-ahmed deleted the cudnn-bug-fix branch January 26, 2019 00:08
syed-ahmed added a commit to syed-ahmed/pytorch that referenced this pull request Jan 29, 2019
@syed-ahmed syed-ahmed restored the cudnn-bug-fix branch January 29, 2019 03:58
@syed-ahmed syed-ahmed reopened this Jan 29, 2019
syed-ahmed added a commit to syed-ahmed/pytorch that referenced this pull request Jan 29, 2019
facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2019
Summary:
There is a regression in cudnnGet*_v7 that causes slowdown in resnet50 training. I am opening a bug with cuDNN team about this. This reverts commit 3837446.

ezyang 😿
Pull Request resolved: #16484

Differential Revision: D13924755

Pulled By: soumith

fbshipit-source-id: 8c719345fc443f1289539bfae630eea9224ba4a5
// update convDesc mathType since cudnn 7.4+ now requires both algo + mathType to figure out
// whether to use Tensor core kernels or not
// See Note [behavior of cudnnFind and cudnnGet]
AT_CUDNN_CHECK(cudnnSetConvolutionMathType(args.cdesc.mut_desc(), perfResults[best_algo_idx].mathType));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work. This will set correct math type only the first time, when you are actually running find. All the subsequent times cdesc will be created with the default math type (default for fp32, tensor cores for fp16), and it won't ever be changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I see what you are saying. I was trying to address @ezyang 's request on removing the following code blocks and this can either stay like this or I could make the setter take perf_t and do this there once.

if (dataType == CUDNN_DATA_HALF) {
    algoPerf->mathType = CUDNN_TENSOR_OP_MATH;
} else {
    algoPerf->mathType = CUDNN_DEFAULT_MATH;
}

Co-authored-by: Natalia Gimelshein <ngimelshein@nvidia.com>
// When cudnnSetConvolutionMathType is called before cudnnGet/cudnnFind, it informs
// cudnnGet/cudnnFind to iterate/take into account both tensor core and non-tensor-core algos.
// If you don't call cudnnSetConvolutionMathType before calling cudnnGet/cudnnFind,
// cudnnGet/cudnnFind may not pick tensor core algos.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to accept this patch as is, but one thing I don't quite get from the description here is whether or not the MATH parameter you pass in matters or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter as it turns out. Should I word this differently (I don't think one can decipher this from the cuDNN docs either)? For instance, before this patch, since I didn't pass CUDNN_TENSOR_OP_MATH, cudnnFind/cudnnGet was not picking the algorithms that enable tensor cores.

@syed-ahmed
Copy link
Collaborator Author

Here are the resnet50 numbers with top of tree. Will update with MaskRCNN numbers as soon as I get them:

Resnet-50, FP16, 8xV100-16GB, batch-size: 256
---------------------------------------------
5689 images/second without PR
5912 images/second with PR

@ezyang
Copy link
Contributor

ezyang commented Feb 4, 2019

I attempted to read the algo selection code a few times this weekend, and just gave up lol. I'm going to just go ahead and merge this and trust you guys ;)

@ngimel
Copy link
Collaborator

ngimel commented Feb 4, 2019

@ezyang but you wrote it, or at least moved it from the previous incarnation of cudnn bindings :-)

@ezyang
Copy link
Contributor

ezyang commented Feb 5, 2019

I plead copy paste!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Feb 5, 2019
Summary:
This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.

Changelist:
- Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`.
- Used cudnnGet*_v7 everywhere cudnnGet* was being used.
- Removed all cudnn6 paths (This PR depends on #15851)

Differential Revision: D13957944

Pulled By: ezyang

fbshipit-source-id: a88c39d80ae37f2d686665622302b62b50fab404
@syed-ahmed syed-ahmed closed this Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants