Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info from lyclminer dev on Lyra2 4x4 vs Lyra2 8x8 on AMD #54

Closed
mikerodey opened this issue Sep 16, 2018 · 7 comments
Closed

Info from lyclminer dev on Lyra2 4x4 vs Lyra2 8x8 on AMD #54

mikerodey opened this issue Sep 16, 2018 · 7 comments

Comments

@mikerodey
Copy link
Collaborator

Since you are dealing with a Lyra2 8x8 matrix on AMD hardware I thought the following info from the lyclminer dev may be helpful. I reached out to him when trying to implement a Lyra2 8x8 matrix for an AMD optimized allium miner. Sorry for opening an issue but I don't know how else to reach you, and sorry if this info is useless to you.

......

James Lovejoy from VTC asked me about lyra8x8 long ago during lyclMiner release.
However, i have to disappoint you, most optimizations can't be reused. I mean, you won't be able to use my approach to match performance of closed source implementations.
Lyra2 uses Blake2B-like round which can be paralleled up to 4 lanes without any noticeable performance impact.

Lyra2REv2 uses Lyra2 4x4 matrix -> 4424 = 384(uint32) registers / 4 lanes = 96 registers per lane. All AMD GCN GPUs have 256 registers available(per lane).
This allowed me to implement it with 128 registers limit, and achieve impressive performance numbers. Lyra2 1x4x4 takes nearly the same amount of time as CubeHash256.
Regarding ASM kernel sources.
I am assuming you mean some high level OpenCL C code with inline ASM stuff. It doesn't exist. Asm kernel is nearly the same as fallback version.
At that time AMD driver compiler was extremely bad for this particular algorithm. Lyra2 Wandering phase was a "compiler bug generator".
I've had some email discussions with AMD compiler developers. Several critical bugs were fixed. Wandering phase was rewritten 46 times. It doesn't spill registers anymore.
A "fallback" kernel was compiled -> disassembled -> tweaked and assembled again.
Last time I've checked, there were 18-21% performance differences between OpenCL C and ASM versions on Windows. If compiler is good, it will be 5-8% or less.
I haven't included disassembled text based versions, because they can't be used with OpenCL API, thus they will only waste space inside repo.
You can always disassemble binaries and look inside.
There are differences:

  • Register count was reduced from 168 -> 125/128.
  • LDS load/store ops were replaced with DPP and ds_swizzle(GCN 2) instructions.
    These slightly improve performance, but not that much in this case, because there are enough ALU ops to hide latency of LDS instructions.
    AMD improves their compiler over time, especially on ROCm platform. Performance gap should be much lower there.

Lyra2 1x8x8(Allium, PHI2 and Lyra2z) uses Lyra2 8x8 matrix -> 8824 = 1536(uint32) registers / 4 lanes = 384 registers + about 50 registers for other stuff.
Compiler will generate a lot of replays into Global Memory, resulting in huge pipeline stalls. This will be very slow, especially compared to current closed-sourced lyra2z and phi2 implementations.
Lyra8x8 is a different algorithm from the GPU implementation perspective. There are a lot of performance/power usage trade-offs here.
The fastest approach may not be the best one, since mining is all about perf/watt.

@fancyIX
Copy link
Owner

fancyIX commented Sep 17, 2018

@mikerodey Thanks for the info. I already realized the matrix is so big to store in registers, since the number of registers is so limited. I also tried to split the matrix to 3 parts and see if that can fit into registers, but not working so far. We may need to split the matrix further so that 256 register is enough for one lane.

@mikerodey
Copy link
Collaborator Author

Good luck! The allium coins are tentatively planning to fork to an alliumv2 algorithm that is identical except for using a 4X4 Lyra2 matrix rather than a 8x8 Lyra2 matrix in order to solve this problem. Nobody in the community has the expertise to solve this issue without a lot of time and the coins are pushing for a viable AMD miner sooner rather than later.

@fancyIX
Copy link
Owner

fancyIX commented Sep 22, 2018

@mikerodey I split the ulong[192] matrix in 3 parts to store in registers and use dpp instructions to share it between 4 threads. Since it doesn't require LDS, the number of threads are not limited by LDS size so there can be more threads in a wave front. Storing matrix in registers is supposed to be much faster, but I only see 30% faster. I guess syncing the data between 4x4 threads probably is not a good idea. I can try to store the matrix in 2 threads, instead of 4. That will cause vgpr spill in one thread, but if the spill is not too much, the speedup may be near 2x. Just an idea.

@mikerodey
Copy link
Collaborator Author

mikerodey commented Sep 22, 2018

Wow, thanks so much for working on this! I was working on integrating your Lyra2 from phi2 into an allium fork of lyclminer but I'm doing something wrong and don't submit any valid shares. https://github.com/TuxcoinOrg/lyclMiner/tree/alliumv1_experimental

I was next going to try and implement it in an allium fork of sgminer because I figured it may be easier to get it working there. I think using your Lyra2 implementation could significantly improve AMD mining performance on allium so I will continue to try (I'll get your latest commit). If you are able to improve it even further that would be huge. If I am successful I'll be sure to report back what kind of speed improvement I see compared to what allium currently has available for AMD

@fancyIX fancyIX closed this as completed Sep 24, 2018
@fancyIX
Copy link
Owner

fancyIX commented Sep 24, 2018

@mikerodey The new release (beta5b) splits the matrix into 2 parts for 2 threads, instead of beta5's 4 threads. The speed up is over 35% on my vega56. I thought it could be near 2x fast, but it's not. I guess some improvements can be made on the cross lane syncing.

@fancyIX
Copy link
Owner

fancyIX commented Sep 25, 2018

@mikerodey I plan to add allium support in this sgminer. If you strongly object, I can stop.

@mikerodey
Copy link
Collaborator Author

No objections at all! Very excited about allium support being added!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants