You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I split the ulong[192] matrix in 3 parts to store in registers and use dpp instructions to share it between 4 threads. Since it doesn't require LDS, the number of threads are not limited by LDS size so there can be more threads in a wave front. Storing matrix in registers is supposed to be much faster, but I only see 30% faster. I guess syncing the data between 4x4 threads probably is not a good idea. I can try to store the matrix in 2 threads, instead of 4. That will cause vgpr spill in one thread, but if the spill is not too much, the speedup may be near 2x.
The text was updated successfully, but these errors were encountered:
The new release (beta5b) splits the matrix into 2 parts for 2 threads, instead of beta5's 4 threads. The speed up is over 35% on my vega56. I thought it could be near 2x fast, but it's not. I guess some improvements can be made on the cross lane syncing.
I split the ulong[192] matrix in 3 parts to store in registers and use dpp instructions to share it between 4 threads. Since it doesn't require LDS, the number of threads are not limited by LDS size so there can be more threads in a wave front. Storing matrix in registers is supposed to be much faster, but I only see 30% faster. I guess syncing the data between 4x4 threads probably is not a good idea. I can try to store the matrix in 2 threads, instead of 4. That will cause vgpr spill in one thread, but if the spill is not too much, the speedup may be near 2x.
The text was updated successfully, but these errors were encountered: