Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diskann Benchmarking Wrapper #260

Open
wants to merge 72 commits into
base: branch-24.12
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 29, 2024

Brings DiskANN into cuvs-bench

  • Build and search in-memory DiskANN index
  • Build and search SSD DiskANN index
  • Build a cuvs Vamana index on GPU and serialize it in DiskANN format. Search on CPU using in-memory DiskANN search API.

@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 29, 2024
@tarang-jain tarang-jain self-assigned this Jul 29, 2024
@github-actions github-actions bot added the Python label Aug 3, 2024
@tarang-jain tarang-jain marked this pull request as ready for review October 15, 2024 22:07
@tarang-jain tarang-jain requested review from a team as code owners October 15, 2024 22:07
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, it's nice to see the list of supported algorithms growing.

I feel unsettled though about adding new functionality to the common benchmark.hpp header just for one algorithm - this looks unsustainable. I see you probably want to avoid having the user to specify the path to the index file / dataset manually, but I think adding the new functionality in such an ad-hoc way is not a solution.
I think, the easiest solution could be just to pass both paths as algo parameters.
Maybe you can automate this in the python layer of the benchmark. Alternatively, I think it's acceptable to modify the parse_build_param api to give it access to the full json config file; that way, you could get the path to the dataset and fill-in the algo parameters inside that function.

Comment on lines +117 to +121
void cuvs_vamana<T, IdxT>::search(
const T* queries, int batch_size, int k, algo_base::index_type* neighbors, float* distances) const
{
diskann_memory_search_->search(queries, batch_size, k, neighbors, distances);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this algorithm does not use the CUDA stream during search. If this is true, please add the uses_stream() override to disable GPU timing to avoid extra sync overheads and make the benchmark results more fair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU build requires the stream, whereas the search is done on CPU. So we would still need the stream for the build part. So we would still have to set uses_stream() to true for the whole algorithm (which is the default)? I believe we are currently doing it the same way in CAGRA + HNSWLIB.

cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved
cpp/bench/ann/src/diskann/diskann_wrapper.h Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants