Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try sharing matrix in 2 threads instead of 4 #56

Closed
fancyIX opened this issue Sep 22, 2018 · 1 comment
Closed

Try sharing matrix in 2 threads instead of 4 #56

fancyIX opened this issue Sep 22, 2018 · 1 comment
Assignees

Comments

@fancyIX
Copy link
Owner

fancyIX commented Sep 22, 2018

I split the ulong[192] matrix in 3 parts to store in registers and use dpp instructions to share it between 4 threads. Since it doesn't require LDS, the number of threads are not limited by LDS size so there can be more threads in a wave front. Storing matrix in registers is supposed to be much faster, but I only see 30% faster. I guess syncing the data between 4x4 threads probably is not a good idea. I can try to store the matrix in 2 threads, instead of 4. That will cause vgpr spill in one thread, but if the spill is not too much, the speedup may be near 2x.

@fancyIX fancyIX self-assigned this Sep 22, 2018
fancyIX added a commit that referenced this issue Sep 24, 2018
Phi2 logically correct now.
fancyIX added a commit that referenced this issue Sep 24, 2018
Works correctly but very slow
fancyIX added a commit that referenced this issue Sep 24, 2018
Phi2 is good now. Pretty fast.
fancyIX added a commit that referenced this issue Sep 24, 2018
@fancyIX
Copy link
Owner Author

fancyIX commented Sep 24, 2018

The new release (beta5b) splits the matrix into 2 parts for 2 threads, instead of beta5's 4 threads. The speed up is over 35% on my vega56. I thought it could be near 2x fast, but it's not. I guess some improvements can be made on the cross lane syncing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant