-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU acceleration for Apple's M1 chip? #47702
Comments
See also #47688, which is another issue on the same topic. |
https://blog.tensorflow.org/2020/11/accelerating-tensorflow-performance-on-mac.html Something like this in Pytorch would definitely be very cool! |
@dbermond From this blog it seems like PyTorch can also take advantage of the ML Compute Framework which is just added in macOS 11 and iOS 14. However, this is not compatible with older macOS and iOS and you still need to create computation graphs upfront. I'm also wondering if Apple is going to collaborate with Facebook to bring the acceleration to PyTorch. |
It's 2020 now and AMD GPU are still not officially supported by PyTorch... How much percentage of GPU support do you expect to have? Maybe we should focus on how to get PyTorch CPU work on M1 before we jump to GPU... |
I don't know but the Pytorch team seems to be exclusively fond of Intel and Nvidia. I have a general dislike for tensorflow but at least they provide some support for AMD and OpenCL(SYCL) - the ways things are going now, I think I'd be going back to TF despite all the drawbacks. |
I hope so !!! |
Perhaps "GPU acceleration" is not the best title for this issue. To be clear, I put that there because it seems to be the most promising component yielding the largest boost....based on the presentation we saw from Apple last week (15x better). That needs verification through tests though. It could be beneficial to look at the chip itself. The ML Compute Framework seems to suggest that training can take place on both CPU and GPU...although the text, imo, isn't definitive. But here's what it says:
So the unified memory architecture is what Apple says it is, then there's no need to copy data between CPU and GPU. And it's the chip that really becomes the focus. That pretty serious. |
That said, I do believe there will be performance differences between CPU, GPU and Neural engine. I think that's a given. |
@toshi2k2 Work is being done on HIP (AMD) and a nightly version is already out: #10670 (comment) |
@BramVanroy it is being done but its still unstable. And there is still no official support for HIP. Point is, an open source project should not, in all morality, cater to anti open source and monopolistic companies. They can build their own versions of frameworks or accelerators whenever they want (e.g. Apple). |
Which part of CUDA is open source? Just asking because apart from AMD, everyone has made their accelerators proprietary. |
I think some of the replies here suggesting that this shouldn't be a priority are pretty myopic. The M1 is not specialty hardware, like a ML-capable GPU. It is standard, everyday, consumer-grade hardware. It's going to be everywhere. Any ML framework that doesn't support it is guaranteed to fall behind in usage compared to those that do. |
What I understand here is that this kind of perspective is the reason behind frameworks getting handed over to a few vendors. :) Also, you don't have to care about any product since this isn't a consumer forum asking for reviews of a gadget. Speaking of walled garden, I'm very intrigued to see which part of CUDA is open source. Afaik, MLCompute, which Apple used to build their private fork of Tensorflow, is also closed behind bars like CUDA. The only Open Source solution out there is ROCm, which PyTorch is reluctant to support and making AMD play catch up. |
The conversation is getting a bit heated. Toshi2k2 is taking the Richard Stallman perspective which is valid. Unfortunately the GPU industry is non free and there's not much you can do about it. If you want to do AI you're stuck with non free licenses. I for one have a new Apple M1 device and would like to see pytorch support for it. |
Also, this childish "Apple bad" mentality isn't going to help anybody. As someone mentioned above, you have to support commodity hardware. Be it open source or anything else. |
Richard Stallman's perspective works great on paper. In practice, people need to get work done and what gets work done on majority of GPUs isn't open source. I'm not against the open source ideology but sometimes, people need to understand that just because they don't eat ice cream on Monday mornings, other people should do the same as well. |
Please read again what I wrote. :) I don't know why you're acting like a angry redditor with brand bias. I for one would love to see ROCm succeed but at it's current state, it's barely usable for most use cases. And there are a lot of people out there to whom getting the job done matters more than having a fully open source compliant setup and open source ethics. And, that's where ROCm and their HIP approach is useless. Just support vega and polaris GPUs? Tie to a custom kernel only? Really? I guess I read comments similar to yours there as well when someone asked for ROCm to support Windows. And the reply read the same, go make your own or fly kites. Perhaps that's the reason why such projects fail in the long run. If the world had to run on this Richard Stallmanistic pureview we won't have had a lot of things in existence. :) |
this conversation is super unproductive. PyTorch stands for pragmatism. We have finite engineering time, and whatever works best for our users in terms of flexibility, user-friendliness, performance and support is first priority for us. We don't really try to stand for or promote a viral open-source philosophy such as GPL.
@toshi2k2 if you actually move on to TensorFlow, and you try out SyCL / ROCm ports of TensorFlow and you are happy with the experience, please do share here. We are pragmatic, not egoistic, we will learn from anyone and anywhere and prioritize to integrate the best things into PyTorch with the finite time we have. |
I just compiled PyTorch pure CPU version from source code, it works fine on M1. Haven't benchmarked yet. First install miniconda with Python 3.9 and Tensorflow, numpy, scipy etc for M1 using this link:
That part was easy. Here is a precompiled wheel in case you are interested: In case someone is interested, George Hotz is hacking to get the Neural Engine to work: |
@erwincoumans can you compile with python 3.8 instead of python 3.9? This is what tensorflow-macos is using right now. It would be cool just to have one environment for comparison. |
@denfromufa the tensorflow-macos is incomplete and doesn't ship with include headers and libraries, so we cannot point CMAKE_PREFIX_PATH to a path. So for now, it is just switching between virtualenv+python3.8+tensorflow-macos and miniconda3+python3.9+pytorch. It makes sense to use the M1 for inference, converting a PyTorch or TF model to CoreML using https://coremltools.readme.io/docs, that would let you use the neural engine, gpu or cpu. coremltools can be imported from python 3.9, so converting a pytorch model would work. The tensorflow-macos python 3.8 environment doesn't support coremltools yet, since scipy is not supported at the moment. |
Any updates? Will PyTorch add M1 support? |
@erwincoumans did you have to disable NNPACK and XNNPACK in your build ? |
Just looking at some responses above and wondering how to help with PyTorch and macOS MLCompute adoption. Just a few thoughts on "how can the community help?"
|
Someone asked "any updates", and sure a few things were learned on macOS+MLCompute during the last week. With respect to my last week comment:
Now, first, I managed to write ObjectiveC code to run a batched MatMul (aka GEMM) layer with MLCompute. The whole graph, inference graph, compile, execute thing with a few deterministic inputs to verify correctness. On CPU and GPU. I concentrated on the whole inference side first for simplicity.
Next step, integration of the MLCompute/ObjC with my speed and performance test (in C++) to compare a repeated 1024x1024 MatMul in Accelerate(GEMM), MLCompute, BNNS and Metal/MPS. Both on CPU and and GPU. First impression on the CPU case: CPU based speed via Accelerate is fastest, say a reference speed of 1.0x. Then, BNNS is roughly 1.1x that speed, MPS/Metal about 0.5x IIRC and MLCompute about 1.5 to 2x slower compared to plain Accelerate. That lead to a bit of checking on the various buffer copies, syncs. My impression was that MLCompute is slower because it is doing more mem-copies in this first prototype. Therefore, built another test case in ObjectiveC for Apple to verify how many Tensors/buffers are created. On repeated inference, it proves that a new result tensor was generated every time after execute i.e. a new 1024x1024 which certain slows things down a bit. Identified two ways presumably to use a give output tensor and avoids copies and then noticed that I still get a new result tensor every time. Either a bug on MLcompute side or wrong use of MLCompute on my side :) Just speculating, but maybe this shows up badly only on my Intel iMac as the M1 has a more unified memory where system specific sync shortcuts could make a big difference. Opened a forum discussion for Apple at https://developer.apple.com/forums/thread/670334 and a proper FeedbackAssistant request FB8957414 with an attached test case. Funnily enough, I had to put it in their CoreML section as there was no MLCompute section yet (that I could see :) The issue is simply that I was unable to use a GIVEN result tensor and avoid extraneous copies of large result tensors. Probably my bad but might be a bug in MLcompute as well. In summary, Next steps Just my 2pc. Any other suggestions? |
In advance, I would like to apologize for pinging everyone on this thread. I apologize for the delay since it has been about 2 weeks since I have heard from Apple, but when asking them if they ever have plans to merge their version of This makes me think that perhaps Most of what I said is hypothetical, but the next thing I will be doing is to just ask them in an upfront manner if they ever have plans to release an accelerated Pytorch. However, I would assume that they would not have plans and any changes to implement GPU acceleration would need to be provided through the developments made by the Pytorch team or from us as a community. |
i'lll be trying nod if I can get the time - some sort of way to run pytorch on m1 gpu 'with a few lines of code' |
Thanks for sharing. This seems interesting. |
Have you tried out nod yet? Does it work for training? I only saw benchmarks for inference |
I didn't get the time yet to try nod but did fool with some training under tensorflow and man that m1 chip is impressive - barely burning 2 watts at full bore at what looks like speed better than my razr blade stealth laptop gpu which sounds like a small jet engine, albeit I've not seen the tf performance benchmarks yet |
I can't imagine pytorch with m1 ultra with ultrafusion 2.5tb/s. The memory unified can admit large models |
Just waiting for pytorch beta on wwdc 2022 |
128GB GPU memory on just this gen M1 Ultra, imagine next gen with 256GB GPU Ram. Supporting this platform is a must, it will allow training of models that would previously require multi-GPU hardware not accessible to most people. |
Are we close to seeing a public beta release of PyTorch acceleration for macOS? The Mac Studio has a ton of GPU power just waiting to be harnessed. I also notice this job listing over at Apple: https://jobs.apple.com/en-us/details/200265506/accelerating-pytorch-on-macs-with-bnns It seems they are looking into accelerating PyTorch with BNNS and the Accelerate Framework. |
That looks awesome!! Looking at Apple 's history of making TensorFlow closed source worries me that PyTorch from Apple will also be closed. Just binaries. |
Yeah. It's definitely needed. We can't afford to train models on Colab, GCP, or AWS for costly GPUs! M1 chips are just unexpectedly crazy good. |
This does not make much sense. PyTorch already uses Accelerate and the AMX. Maybe they're looking for lower overhead? The main purpose of preferring a CPU is for low latency when you can't build a graph (RNNs and RL). When you can build a graph, I'm not sure the CPU is commonly faster than the GPU - look at the graphs in the article about SHARK/IREE. If you can get the GPU driver calls to have extremely low overhead (~1 microsecond per command) in eager mode, then the GPU essentially replaces the CPU. Thus, they would be better off building a Metal backend. But Metal has a driver latency of 10 us per command buffer, and MPSGraph is even worse - 100 us to create an |
Exciting news! YOLOv5 inference (but not training) is currently supported on Apple M1 neural engine (all variants). Results show 13X speedup vs CPU on base 2020 M1 Macbook Air ResultsYOLOv5 🚀 v6.1-25-gcaf7ad0 torch 1.11.0 CPU
Reproducegit clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -r requirements.txt # install (requires python > 3.7)
python export.py --weights yolov5s.pt --include coreml # export creates yolov5s.mlmodel
python detect.py --weights yolov5s.pt # PyTorch inference
python detect.py --weights yolov5s.mlmodel # CoreML inference EDIT: Results run on battery (95% state of charge). Will re-run tomorrow connected to power. |
I originally proposed the idea that Apple is collaborating with PyTorch. Now, I’m connecting the dots. PyTorch, your secret is going to be leaked real soon. I just need to do some final validation of the supporting evidence. I’ll announce my conclusions on this thread ASAP. |
Actually running simple model using core-ml on m1 is super easy. If you test on smaller image, then it may even outperform a single GTX1080Ti machine. |
First, my sincere apology to those whom I'm not targeting to. But to those who may think issues are where you can discuss anything, this is for you: |
Hey all, We are looking forward to your feedback on this new experimental feature!
|
🚀 Feature
Hi,
I was wondering if we could evaluate PyTorch's performance on Apple's new M1 chip. I'm also wondering how we could possibly optimize Pytorch's capabilities on M1 GPUs/neural engines.
I know the issue of supporting acceleration frameworks outside of CUDA has been discussed in previous issues like #488..but I think this is worth a revisit. In Apple's big reveal today, we learned that Apple's on a roll with 50% of product usage growth being as a result of new users this year. Given that Apple is moving to these in-house designed chips, enhanced support for these chips could make deep learning on personal laptops a better experience for many researchers and engineers. I think this really aligns with PyTorch's theme of facilitating deep learning from research to production.
I'm not quite sure how this should go down. But these could be important:
cc @VitalyFedyunin @ngimel
The text was updated successfully, but these errors were encountered: