Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid search #2969

Merged
merged 18 commits into from
Sep 26, 2024
Merged

Hybrid search #2969

merged 18 commits into from
Sep 26, 2024

Conversation

manyoso
Copy link
Collaborator

@manyoso manyoso commented Sep 19, 2024

The first three commits are not strictly necessary for hybrid search. The first one is important though as we should maintain the same order of results that the embedding search returns. Even this has an impact on beir test results.

The second and third commits are about addressing a problem in our current chunking strategy where the maximum chunk size is not strictly enforced. These two changes enforce a strict maximum chunk size while not changing anything else about our chunking strategy.

The fourth commit is the actual hybrid search. It introduces the fts virtual table and implements reciprocal rank fusion to combine bm25 keyword search with the embedding search.

RRF paper: https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

Beir dataset paper: https://arxiv.org/pdf/2306.07471

The following changes improve our performance across four beir datasets. Future changes will be forthcoming integrating the test harness used to assess and make these changes. For now, here are screenshots showing some of the results:

image

image

I also tested k=3 with 512 chunk size which matches our localdocs defaults and the numbers again showed improvements for hybrid search.

The one dataset that doesn't show clear improvement at 512 chunk size is fiqa, but it does show improvement with document sized chunks. Still researching how to improve performance on this one.

Also: I'm considering adding a configuration option to turn on/off hybrid search, but I think this is good to go in now.

Signed-off-by: Adam Treat <treat.adam@gmail.com>
…s in

exact same database on quality dataset.

Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
@manyoso manyoso marked this pull request as draft September 19, 2024 17:10
@manyoso manyoso marked this pull request as ready for review September 19, 2024 21:21
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
gpt4all-chat/src/database.cpp Show resolved Hide resolved
gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
This reverts commit eb39a36.

Signed-off-by: Adam Treat <treat.adam@gmail.com>
…d results in"

This reverts commit 552b471.

Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Signed-off-by: Adam Treat <treat.adam@gmail.com>
@manyoso manyoso merged commit 10d2375 into main Sep 26, 2024
4 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants