Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: add milvus hybrid search retriever #20375

Closed
wants to merge 12 commits into from

Conversation

BuxianChen
Copy link
Contributor

@BuxianChen BuxianChen commented Apr 12, 2024

Description:
I add milvus hybrid search retriever. The PR messages remains to complete, and I need some advices for the implementation. The code that I already committed can run successfully, but need to be improved, refactor, lint (I have some problem in format, lint, etc).

Here are my plans, and need some helps:

  1. May refactor the code, need some advises. Meanwhile, I need do more tests, I believe there are many bugs in my current implementation.
  2. Pass the lint, format, mypy checks, add docstrings and type hints.
  3. Add more functional: support hybrid search, single vector search and query (i.e. no vector sematic search) by MilvusHybridSearchRetriever.

Dependencies:

Milvus>=2.4.0
pymilvus>=2.4.0

Docs

  • an example notebook showing its use. It lives in docs/docs/integrations/retrievers/milvus_hybrid_search.ipynb.

Thank you for contributing to LangChain!

  • PR title: "package: description"

    • Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes.
    • Example: "community: add foobar LLM"
  • PR message: Delete this entire checklist and replace with

    • Description: a description of the change
    • Issue: the issue # it fixes, if applicable
    • Dependencies: any dependencies required for this change
    • Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out!
  • Add tests and docs: If you're adding a new integration, please include

    1. a test for the integration, preferably unit tests that do not rely on network access,
    2. an example notebook showing its use. It lives in docs/docs/integrations directory.
  • Lint and test: Run make format, make lint and make test from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:

  • Make sure optional dependencies are imported within a function.
  • Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests.
  • Most PRs should not touch more than one package.
  • Changes should be backwards compatible.
  • If you are adding something to community, do not re-import it in langchain.

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 12, 2024
Copy link

vercel bot commented Apr 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 1, 2024 10:24am

@dosubot dosubot bot added Ɑ: retriever Related to retriever module 🔌: milvus Primarily related to Milvus vector store integration labels Apr 12, 2024
@BuxianChen BuxianChen changed the title add milvus hybrid search retriever community: add milvus hybrid search retriever Apr 12, 2024
@BuxianChen BuxianChen marked this pull request as draft April 12, 2024 08:25
@BuxianChen BuxianChen marked this pull request as ready for review April 12, 2024 09:04
@dosubot dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Apr 12, 2024
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 12, 2024
@ccurme ccurme added the community Related to langchain-community label Jun 18, 2024
@ohadeytan
Copy link
Contributor

ohadeytan commented Aug 6, 2024

Hi, that looks very good @BuxianChen!

As part of a group in IBM Research, we are looking for similar needs: building and using a Milvus vector store with the ability to do hybrid search. We would like to continue to use langchain rather than using pymilvus directly.

@efriis can you share what is the intention with this and similar PRs related to hybrid search?
I saw that under the partners directory you have a HybridSearchRetriever, but it lacks the creation of the collection and similar setting that exists in this PR and in regular MilvusVertorStore.

We are willing to implement and submit PRs with your guidance.

@BuxianChen
Copy link
Contributor Author

BuxianChen commented Aug 8, 2024

@ohadeytan Hi, I think this PR is out of date, maybe as well as langchain_community/retrievers/milvus.py, maybe all things related to Milvus should be placed to partner directory.

My PR was completed before their partner PR, but theirs has been merged. But as you point out, their implement lack of the creation of the collection and similar setting in regular MilvusVertorStore.

I think you can communicate with langchain's core dev, then integrate the missing part to partner directory. I think this work needs some refactor, as I borrowed a lot code from MilvusVertorStore.

Best wishes!

@BuxianChen
Copy link
Contributor Author

By the way, I think the awkward things are:

  • BaseVectorStore is assumed to with a dense vector embedding.
  • BaseRetriever has no abstract interface like add_document.

I'm also confused about how to deal with that.

@ohadeytan
Copy link
Contributor

@BuxianChen, yeah, it seems they are moving to the partners directory, but the question remains, did they support this kind of changes and can provide feedback and guidance.

@zc277584121
Copy link
Contributor

Thank you for you contribution.
Vector store may refer to "dense vector store", and maybe sparse and hybrid functions need to be placed under Retriever. There is now an implementation of MilvusCollectionHybridSearchRetriever.
https://python.langchain.com/v0.2/docs/integrations/retrievers/milvus_hybrid_search/
Here what you see is

collection = Collection(
    ...
)
retriever = MilvusCollectionHybridSearchRetriever(
    collection=collection,
    ...
)

In the near future, MilvusClient SDK will support hybrid.
The best ideal implementation will be like this:

from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")

MilvusHybridSearchRetriever(
    client=client,
    ...
)

@efriis
Copy link
Member

efriis commented Aug 26, 2024

closing because should be in partner package! seems like it might already be there too

@efriis efriis closed this Aug 26, 2024
efriis added a commit that referenced this pull request Aug 30, 2024
…25284)

# Description
Milvus (and `pymilvus`) recently added the option to use [sparse
vectors](https://milvus.io/docs/sparse_vector.md#Sparse-Vector) with
appropriate search methods (e.g., `SPARSE_INVERTED_INDEX`) and
embeddings (e.g., `BM25`, `SPLADE`).

This PR allow creating a vector store using langchain's `Milvus` class,
setting the matching vector field type to `DataType.SPARSE_FLOAT_VECTOR`
and the default index type to `SPARSE_INVERTED_INDEX`.

It is only extending functionality, and backward compatible. 

## Note
I also interested in extending the Milvus class further to support multi
vector search (aka hybrid search). Will be happy to discuss that. See
[here](#19955),
[here](#20375), and
[here](#22886)
similar needs.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community 🤖:improvement Medium size change to existing code to handle new use-cases 🔌: milvus Primarily related to Milvus vector store integration Ɑ: retriever Related to retriever module size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants