-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of data prefetching? #203
Comments
Hi James, good idea! I was actually doing some profiling yesterday (on a physical machine) and looked at that exact line in the
You just need to run CMake with [...]
I’ll try that. |
I’m not sure entirely, but I think most of the time for my test dataset is not spent not in the hash table lookups, but in the code after it (the |
I made a mistake in my first assesment. The loop with the hash table lookups is responsible for a large chunk of the runtime of Now the problem is that I need to figure out how to actually prefetch the next hash table entry. I tried your suggestion above, but there’s no difference. I think we would need to issue an actual prefetch instruction that makes the memory access in the background while the CPU continues to do its work. I’ll need to come back to this later. |
This is just an idea, which may go nowhere at all. :)
Profiling strobealign I see the most CPU hungry function is find_nams, and the most CPU intensive bit of that I think is
https://github.com/ksahlin/strobealign/blob/main/src/aln.cpp#L282
Perf record / report shows (I've no idea how to get this working with debug info and cmake):
Ie 39% of all CPU for this function is spent waiting on one of those memory moves (it's sometimes the instruction before or two than the one reported, due to pipelining). This is to be expected in any application which is using a large hash table and randomly jump around main memory. I expect perf stat would tell me it's frontend or backend idle times, but sadly this machine is virtual and isn't exposing individual hardware CPU counters.
I've had experience elsewhere on speeding up memory fetches by computing the address that's going to be used in a couple loop iterations time and manually doing a hardware prefetch. It's then in cache by the time we get around to using it.
Here we may even be able to do this by doing something like:
So while it's fetching the next hit it's processing the previous one.
I'm a low level C coder though and unfortunately know nothing about C++, so how to get things like next value out of an implicit iterator is beyond me.
Has anyone looked at the possibility of improving instruction pipelining by prefetching memory addresses? I can't say how much difference it'll make without trying it, but as I say doing that in C++ is beyond my knowledge.
The text was updated successfully, but these errors were encountered: