Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinecone Hybrid Search #659

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

trevorpfiz
Copy link
Contributor

I tried to emulate the langchain python implementation of Pinecone Hybrid Search. I also used the Supabase Hybrid Search implementation as a reference. Thanks @hwchase17 @ericciarla

I went with BertWordPieceTokenizer from tokenizers as the tokenizer. I also had to update the @pinecone-database/pinecone to ^0.0.12 as I could not add sparseVectors otherwise.

Not 100% sure if everything is in working order yet, but I will work on testing it in my local project when I get the chance. A good starting point nonetheless!

@vercel
Copy link

vercel bot commented Apr 7, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 11, 2023 6:02am

@trevorpfiz trevorpfiz changed the title Added Pinecone Hybrid Search Adding Pinecone Hybrid Search Apr 7, 2023
@trevorpfiz trevorpfiz changed the title Adding Pinecone Hybrid Search Pinecone Hybrid Search Apr 7, 2023
@trevorpfiz trevorpfiz changed the title Pinecone Hybrid Search WIP Pinecone Hybrid Search Apr 7, 2023
@trevorpfiz
Copy link
Contributor Author

I had an oversight with using the tokenizers library. It only supports "node": ">=10 < 11 || >=12 <14". Will evaluate alternative methods to create the sparse vectors.

@trevorpfiz trevorpfiz changed the title WIP Pinecone Hybrid Search Pinecone Hybrid Search Apr 9, 2023
@trevorpfiz
Copy link
Contributor Author

@nfcampos @hwchase17 I think this is ready to be reviewed whenever the chance.

I am using bert-tokenizer as the tokenizer for now, but down the road we will want to support a greater variety of tokenizer options. I initially tried to implement tokenizers from Hugging Face, but the node bindings are too old.

@nfcampos
Copy link
Collaborator

Note to me: use tiktoken instead of bert

@nfcampos nfcampos self-assigned this Apr 13, 2023
const embeddings = new OpenAIEmbeddings();
const pineconeIndex = client.Index(process.env.PINECONE_INDEX!);

const tokenizer = new BertTokenizer(undefined, true, 512);
Copy link
Collaborator

@dqbd dqbd May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using js-tiktoken instead, which should already be a required dependency.

Comment on lines +40 to +41
this.topK = args.topK;
this.alpha = args.alpha;
Copy link
Collaborator

@dqbd dqbd May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the default values are being overriden, should that be the case? If not, consider making alpha: number and topK: number

Suggested change
this.topK = args.topK;
this.alpha = args.alpha;
this.topK = args.topK ?? 4;
this.alpha = args.alpha ?? 0.5;

@@ -1,5 +1,9 @@
export { RemoteRetriever } from "./remote/base.js";
export { ChatGPTPluginRetriever } from "./remote/chatgpt-plugin.js";
export {
PineconeHybridSearchRetriever,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a file into a barrel file, create a new entrypoint instead.


pineconeIndex: VectorOperationsApi;

tokenizer: BertTokenizer;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense for the sparse encoder to be more configurable, not assuming BERT tokenization.

alpha: number
): [number[], SparseValues] {
if (alpha < 0 || alpha > 1) {
throw new Error("Alpha must be between 0 and 1");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider validating this.alpha in the constructor.

nit: do we need to pass this.alpha as a parameter? Why not access it directly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider validating this.alpha in the constructor.
nit: do we need to pass this.alpha as a parameter? Why not access it directly?

} from "@pinecone-database/pinecone/dist/pinecone-generated-ts-fetch";
import { BertTokenizer } from "bert-tokenizer";

import { PineconeLibArgs, PineconeMetadata } from "vectorstores/pinecone.js";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid absolute imports, as this causes the example not to build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants