-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] radix::select_k<half> is slow #1725
Comments
Thanks for reporting the problem. Seems it's caused by a combination of a few factors. Although the SIFT-1M dataset is in float32 format, the values are actually integers in the range [0, 218]. For example, for one query, I counted the number of The total number of input values are 25800. The last step of radix select is filtering the values less or equal to the 129th value. Because the 129th value is One possible (but not good) solution is: I feel the root problem is that SIFT dataset is in float32 while it should be uint8. |
Wow, thanks for the insight! Perhaps, we can try some form of normalizing on IVF-PQ side to remove unwanted infinities. However, I've just come to realize that many of the |
Also if the filtering functionality is used in ANN methods, it will output even more infinites in place of filtered data. So we definitely need to make |
I've come up with this solution: #1742 but it seems to reduce the recall even further on the sift-128 data (but QPS is great! :-D). Also I feel like there must be a better solution (at least, limit filtering to the zero-th pass only). With this solution, I'm not really sure, what happens if there are less non-infs than |
Since it's expected that there will be lots of The alternative way is changing radix select to treat |
In my tests, by setting |
How about the change of QPS averaged over all queries? Treating The slowdown case for data type So the question is that is it worth optimizing such edge case, which likely has low recall, at the cost of slowing down normal cases, even radix select itself? The more I think about it the more I feel that we don't need to do anything for this case. |
To my understanding, these |
I understand your concern. Adding a special case for treating a input single value is a logical burden for an algorithm. However, I don't think it adds any significant overhead in the common case. Let's have a look at histogram+filter changes: the only thing it really adds to the loops is a single comparison That being said, I did more benchmarks to verify if there are any slowdowns. Apparently, the extra code and variables increases the register pressure, especially for the single-block version with types double+uint64_t. I was able to fight it back by adding Here's the results: So the conclusion is the time with/out the PR is generally the same, but the edge cases are much faster (i.e. not abnormally slow). |
Regarding the infs due to paddings: In the near future we're going to have many downstream projects using ANN pre-filtering (e.g. for "deleting" values without re-training). This will lead to more infinities in the valid range of the input. So I believe this is a very important use case (but also overflow of |
Do I understand correctly, that your main concern is the more complicated logic because of the extra I think the proposed fix adds the value in that it fixes the x10 slowdown in some edge cases with little to no cost to any of the other cases. Aside from the zero-th pass it doesn't really complicate the logic that much. We could argue whether IVF-PQ should be fixed to not produce as many infinities etc, but I don't think it's relevant here. If the number of infinities in the input of radix-select is larger than I see your point of making too much workarounds to radix-select, but I think the improvement is still worth it. I'd suggest we ask for a third opinion on this. CC @tfeher @cjnolet ? |
For the code logic, I think the code change itself for zero-th pass is fine because the intention is clear. I'm more worried about the effect. For the current code, For radix-select performance, the current code handles From the figure I posted earlier, there are two problems for the "baseline.half" curve: both its QPS and recall are low. These two problems are closely related. Actually they are the result of the same root cause: too many overflowed The PR fixes the low QPS as shown as the pink curve, but the low recall stays the same. Without the improvement of the recall, the improvement of QPS doesn't matter much in this case, because the Overall, the PR improves the low QPS of |
matrix::select_k
for data type half sometimes is extremely slow and takes significant fraction ofivf_pq::search
time (especially non-fused search that is activated for k > 128).The offending implementation is
select::radix::sekect_k<half, ...>
Here are nsys timelines and the tooltips from the last kernels:
Data type: half
NVTX: matrix::select_k(batch_size = 1, len = 255616, k = 129) - 268.671 μs
Data type: float
NVTX: matrix::select_k(batch_size = 1, len = 255616, k = 129) - 30.560 μs
The text was updated successfully, but these errors were encountered: