Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web research retriever #8102

Merged
merged 12 commits into from
Jul 26, 2023

Conversation

rlancemartin
Copy link
Collaborator

@rlancemartin rlancemartin commented Jul 21, 2023

Given a user question, this will -

  • Use LLM to generate a set of queries.
  • Query for each.
  • The URLs from search results are stored in self.urls.
  • A check is performed for any new URLs that haven't been processed yet (not in self.url_database).
  • Only these new URLs are loaded, transformed, and added to the vectorstore.
  • The vectorstore is queried for relevant documents based on the questions generated by the LLM.
  • Only unique documents are returned as the final result.

This code will avoid reprocessing of URLs across multiple runs of similar queries, which should improve the performance of the retriever. It also keeps track of all URLs that have been processed, which could be useful for debugging or understanding the retriever's behavior.

@vercel
Copy link

vercel bot commented Jul 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jul 26, 2023 1:48am

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jul 21, 2023
os.environ["GOOGLE_API_KEY"] = self.GOOGLE_API_KEY
search = GoogleSearchAPIWrapper()
except Exception as e:
print(f"Error: {str(e)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray print

dont we want to raise here?

try:
os.environ["GOOGLE_CSE_ID"] = self.GOOGLE_CSE_ID
os.environ["GOOGLE_API_KEY"] = self.GOOGLE_API_KEY
search = GoogleSearchAPIWrapper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be passed in, so they can configure it outside


# Get search questions
logger.info("Generating questions for Google Search ...")
llm_chain = LLMChain(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the llm_chain should be an attribute on this class, can be constructed from a class method

# This can use rate limit w/ embedding
logger.info("Grabbing most relevant splits from urls ...")
filtered_splits = []
text_splitter = RecursiveCharacterTextSplitter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an argument on the class, so user can configure

try:
text = await response.text()
except UnicodeDecodeError:
print(f"Failed to decode content from {url}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logger not print

@vercel
Copy link

vercel bot commented Jul 24, 2023

@rlancemartin is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

)
llm_chain: LLMChain
search: GoogleSearchAPIWrapper = Field(..., description="Google Search API Wrapper")
search_prompt: PromptTemplate = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think this is needed on the base class anymore

)


DEFAULT_SEARCH_PROMPT = PromptTemplate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very llama2 specific right? could we use a prompt selector for this?

Copy link
Collaborator Author

@rlancemartin rlancemartin Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I chatted w/ @efriis about this in passing; good motivation for improved prompt management. Also a good area to add more documentation; couldn't this in the py docs, but did find a reference to ConditionalPromptSelector in the JS docs. It seems we could use ConditionalPromptSelector.

Copy link
Collaborator Author

@rlancemartin rlancemartin Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears we can do it like this -

        PROMPT_SELECTOR = ConditionalPromptSelector(
            default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[(isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)]
        )

edit: actually, the above errors out. can't pass a conditional isinstance(llm, LlamaCpp); the examples use imported is_chat_model, but that doesn't work for this case.

but of course this also works -

if isinstance(llm, LlamaCpp):
            prompt = DEFAULT_LLAMA_SEARCH_PROMPT
            print("Using LlamaCpp")

        else:
            prompt = DEFAULT_SEARCH_PROMPT

more docs / details on what using a ConditionalPromptSelector buys would be helpful.

Copy link
Member

@efriis efriis Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do

PROMPT_SELECTOR = ConditionalPromptSelector(
    default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)]
)

From the code, looks like the first element of the conditional has to be a callable taking in a language model and returning a bool. https://github.com/langchain-ai/langchain/blob/00de334f81abddf4ce6e46a931a505fa21cf7d98/libs/langchain/langchain/chains/prompt_selector.py#L26C10-L26C10

Copy link
Collaborator Author

@rlancemartin rlancemartin Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, from discussion w/ Harrison usage appears to be:

            QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
                default_prompt=DEFAULT_SEARCH_PROMPT,
                conditionals=[
                    (lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)
                ],
            prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)

WebResearchRetriever
"""

if isinstance(llm, LlamaCpp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still want to let the user pass in a prompt

this should be

prompt: Optional[PromptTemplate] = None

and then you should be able to do the prompt selector as @efriis coded

RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50),
description="Text splitter for splitting web pages into chunks",
)
urls: List[str] = Field(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

"""Returns num_serch_results pages per Google search."""
try:
result = self.search.results(query, num_search_results)
except Exception as e:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt seem necc

raise Exception(f"Error: {str(e)}")
return result

def get_urls(self) -> List[str]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deklete

text_splitter = self.text_splitter
for doc in html2text.transform_documents(loader.load()):
doc_splits = text_splitter.split_documents([doc])
# Proect against very large documents
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in protect

logger.info("Grabbing most relevant splits from urls ...")
filtered_splits = []
text_splitter = self.text_splitter
for doc in html2text.transform_documents(loader.load()):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove

logger.info("Grabbing most relevant splits from urls ...")
_splits = []
text_splitter = self.text_splitter
for doc in html2text.transform_documents(loader.load()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be:

docs = loader.load()
docs = html2text.transform_documents(docs)
docs = self.text_splitter.split_dcouments(docs)

self.vectorstore.add_documents(_splits)
self.url_database.extend(new_urls)

# Search for relevant splits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a todo to make this async

@rlancemartin rlancemartin merged commit 7a00f17 into langchain-ai:master Jul 26, 2023
22 checks passed
hinthornw pushed a commit that referenced this pull request Jul 27, 2023
Given a user question, this will -
* Use LLM to generate a set of queries.
* Query for each.
* The URLs from search results are stored in self.urls.
* A check is performed for any new URLs that haven't been processed yet
(not in self.url_database).
* Only these new URLs are loaded, transformed, and added to the
vectorstore.
* The vectorstore is queried for relevant documents based on the
questions generated by the LLM.
* Only unique documents are returned as the final result.

This code will avoid reprocessing of URLs across multiple runs of
similar queries, which should improve the performance of the retriever.
It also keeps track of all URLs that have been processed, which could be
useful for debugging or understanding the retriever's behavior.

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: embeddings Related to text embedding models module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants