Try sharing matrix in 2 threads instead of 4 #56

fancyIX · 2018-09-22T17:07:47Z

I split the ulong[192] matrix in 3 parts to store in registers and use dpp instructions to share it between 4 threads. Since it doesn't require LDS, the number of threads are not limited by LDS size so there can be more threads in a wave front. Storing matrix in registers is supposed to be much faster, but I only see 30% faster. I guess syncing the data between 4x4 threads probably is not a good idea. I can try to store the matrix in 2 threads, instead of 4. That will cause vgpr spill in one thread, but if the spill is not too much, the speedup may be near 2x.

Phi2 logically correct now.

Works correctly but very slow

Phi2 is good now. Pretty fast.

Feature/#56

fancyIX · 2018-09-24T17:51:53Z

The new release (beta5b) splits the matrix into 2 parts for 2 threads, instead of beta5's 4 threads. The speed up is over 35% on my vega56. I thought it could be near 2x fast, but it's not. I guess some improvements can be made on the cross lane syncing.

fancyIX self-assigned this Sep 22, 2018

fancyIX added a commit that referenced this issue Sep 24, 2018

Issue #56

7c4d90e

Phi2 logically correct now.

fancyIX added a commit that referenced this issue Sep 24, 2018

Issue #56

fe68418

Works correctly but very slow

fancyIX added a commit that referenced this issue Sep 24, 2018

Issue #56

44b3c01

Phi2 is good now. Pretty fast.

fancyIX closed this as completed in a563c64 Sep 24, 2018

fancyIX added a commit that referenced this issue Sep 24, 2018

Merge pull request #58 from fancyIX/feature/#56

2e2f2a8

Feature/#56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try sharing matrix in 2 threads instead of 4 #56

Try sharing matrix in 2 threads instead of 4 #56

fancyIX commented Sep 22, 2018

fancyIX commented Sep 24, 2018

Try sharing matrix in 2 threads instead of 4 #56

Try sharing matrix in 2 threads instead of 4 #56

Comments

fancyIX commented Sep 22, 2018

fancyIX commented Sep 24, 2018