Web research retriever #8102

rlancemartin · 2023-07-21T23:48:04Z

Given a user question, this will -

Use LLM to generate a set of queries.
Query for each.
The URLs from search results are stored in self.urls.
A check is performed for any new URLs that haven't been processed yet (not in self.url_database).
Only these new URLs are loaded, transformed, and added to the vectorstore.
The vectorstore is queried for relevant documents based on the questions generated by the LLM.
Only unique documents are returned as the final result.

This code will avoid reprocessing of URLs across multiple runs of similar queries, which should improve the performance of the retriever. It also keeps track of all URLs that have been processed, which could be useful for debugging or understanding the retriever's behavior.

vercel · 2023-07-21T23:48:07Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Jul 26, 2023 1:48am

hwchase17 · 2023-07-24T00:38:45Z

libs/langchain/langchain/retrievers/web_research.py

+            os.environ["GOOGLE_API_KEY"] = self.GOOGLE_API_KEY
+            search = GoogleSearchAPIWrapper()
+        except Exception as e:
+            print(f"Error: {str(e)}")


stray print

dont we want to raise here?

hwchase17 · 2023-07-24T00:38:59Z

libs/langchain/langchain/retrievers/web_research.py

+        try:
+            os.environ["GOOGLE_CSE_ID"] = self.GOOGLE_CSE_ID
+            os.environ["GOOGLE_API_KEY"] = self.GOOGLE_API_KEY
+            search = GoogleSearchAPIWrapper()


this should be passed in, so they can configure it outside

hwchase17 · 2023-07-24T00:39:36Z

libs/langchain/langchain/retrievers/web_research.py

+
+        # Get search questions
+        logger.info("Generating questions for Google Search ...")
+        llm_chain = LLMChain(


the llm_chain should be an attribute on this class, can be constructed from a class method

hwchase17 · 2023-07-24T00:39:56Z

libs/langchain/langchain/retrievers/web_research.py

+        # This can use rate limit w/ embedding
+        logger.info("Grabbing most relevant splits from urls ...")
+        filtered_splits = []
+        text_splitter = RecursiveCharacterTextSplitter(


this should be an argument on the class, so user can configure

hwchase17 · 2023-07-24T00:40:24Z

libs/langchain/langchain/document_loaders/async_html.py

+                        try:
+                            text = await response.text()
+                        except UnicodeDecodeError:
+                            print(f"Failed to decode content from {url}")


use logger not print

vercel · 2023-07-24T17:51:31Z

@rlancemartin is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

hwchase17 · 2023-07-24T22:59:19Z

libs/langchain/langchain/retrievers/web_research.py

+    )
+    llm_chain: LLMChain
+    search: GoogleSearchAPIWrapper = Field(..., description="Google Search API Wrapper")
+    search_prompt: PromptTemplate = Field(


i dont think this is needed on the base class anymore

hwchase17 · 2023-07-24T22:59:37Z

libs/langchain/langchain/retrievers/web_research.py

+    )
+
+
+DEFAULT_SEARCH_PROMPT = PromptTemplate(


this is very llama2 specific right? could we use a prompt selector for this?

Ya, I chatted w/ @efriis about this in passing; good motivation for improved prompt management. Also a good area to add more documentation; couldn't this in the py docs, but did find a reference to ConditionalPromptSelector in the JS docs. It seems we could use ConditionalPromptSelector.

it appears we can do it like this -

PROMPT_SELECTOR = ConditionalPromptSelector( default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[(isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)] )

edit: actually, the above errors out. can't pass a conditional isinstance(llm, LlamaCpp); the examples use imported is_chat_model, but that doesn't work for this case.

but of course this also works -

if isinstance(llm, LlamaCpp): prompt = DEFAULT_LLAMA_SEARCH_PROMPT print("Using LlamaCpp") else: prompt = DEFAULT_SEARCH_PROMPT

more docs / details on what using a ConditionalPromptSelector buys would be helpful.

Could you do

PROMPT_SELECTOR = ConditionalPromptSelector( default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)] )

From the code, looks like the first element of the conditional has to be a callable taking in a language model and returning a bool. https://github.com/langchain-ai/langchain/blob/00de334f81abddf4ce6e46a931a505fa21cf7d98/libs/langchain/langchain/chains/prompt_selector.py#L26C10-L26C10

ya, from discussion w/ Harrison usage appears to be:

QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector( default_prompt=DEFAULT_SEARCH_PROMPT, conditionals=[ (lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT) ], prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)

hwchase17 · 2023-07-25T02:26:15Z

libs/langchain/langchain/retrievers/web_research.py

+            WebResearchRetriever
+        """
+
+        if isinstance(llm, LlamaCpp):


we still want to let the user pass in a prompt

this should be

prompt: Optional[PromptTemplate] = None

and then you should be able to do the prompt selector as @efriis coded

rlancemartin · 2023-07-25T21:24:13Z

libs/langchain/langchain/retrievers/web_research.py

+        RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50),
+        description="Text splitter for splitting web pages into chunks",
+    )
+    urls: List[str] = Field(


rlancemartin · 2023-07-25T21:25:20Z

libs/langchain/langchain/retrievers/web_research.py

+        """Returns num_serch_results pages per Google search."""
+        try:
+            result = self.search.results(query, num_search_results)
+        except Exception as e:


doesnt seem necc

rlancemartin · 2023-07-25T21:25:30Z

libs/langchain/langchain/retrievers/web_research.py

+            raise Exception(f"Error: {str(e)}")
+        return result
+
+    def get_urls(self) -> List[str]:


rlancemartin · 2023-07-25T21:26:45Z

libs/langchain/langchain/retrievers/web_research.py

+            text_splitter = self.text_splitter
+            for doc in html2text.transform_documents(loader.load()):
+                doc_splits = text_splitter.split_documents([doc])
+                # Proect against very large documents


typo in protect

rlancemartin · 2023-07-25T21:27:27Z

libs/langchain/langchain/retrievers/web_research.py

+            logger.info("Grabbing most relevant splits from urls ...")
+            filtered_splits = []
+            text_splitter = self.text_splitter
+            for doc in html2text.transform_documents(loader.load()):


lets remove

hwchase17 · 2023-07-26T01:32:32Z

libs/langchain/langchain/retrievers/web_research.py

+            logger.info("Grabbing most relevant splits from urls ...")
+            _splits = []
+            text_splitter = self.text_splitter
+            for doc in html2text.transform_documents(loader.load()):


this can just be:

docs = loader.load() docs = html2text.transform_documents(docs) docs = self.text_splitter.split_dcouments(docs)

hwchase17 · 2023-07-26T01:32:46Z

libs/langchain/langchain/retrievers/web_research.py

+            self.vectorstore.add_documents(_splits)
+            self.url_database.extend(new_urls)
+
+        # Search for relevant splits


can you add a todo to make this async

Given a user question, this will - * Use LLM to generate a set of queries. * Query for each. * The URLs from search results are stored in self.urls. * A check is performed for any new URLs that haven't been processed yet (not in self.url_database). * Only these new URLs are loaded, transformed, and added to the vectorstore. * The vectorstore is queried for relevant documents based on the questions generated by the LLM. * Only unique documents are returned as the final result. This code will avoid reprocessing of URLs across multiple runs of similar queries, which should improve the performance of the retriever. It also keeps track of all URLs that have been processed, which could be useful for debugging or understanding the retriever's behavior. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

Web research retriever

79049b1

dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jul 21, 2023

rlancemartin added 2 commits July 21, 2023 18:38

fmt

fcf29fc

Update

0b59dec

hwchase17 reviewed Jul 24, 2023

View reviewed changes

Address comments

cf44cb1

Add local database of urls we've already loaded into vectorstore

4f4f98e

vercel bot deployed to Preview – langchain July 24, 2023 19:13 View deployment

hwchase17 reviewed Jul 24, 2023

View reviewed changes

Address comments

e142d88

vercel bot deployed to Preview – langchain July 25, 2023 01:08 View deployment

hwchase17 reviewed Jul 25, 2023

View reviewed changes

rlancemartin added 2 commits July 25, 2023 10:44

Merge branch 'master' into rlm/web-research-agent

2763ff8

Address comments

a7cd578

vercel bot deployed to Preview – langchain July 25, 2023 18:28 View deployment

fmt

305ca5e

rlancemartin force-pushed the rlm/web-research-agent branch from 70c2be0 to 305ca5e Compare July 25, 2023 19:03

vercel bot deployed to Preview – langchain July 25, 2023 19:10 View deployment

rlancemartin commented Jul 25, 2023

View reviewed changes

Address comments

b785910

vercel bot deployed to Preview – langchain July 25, 2023 21:51 View deployment

hwchase17 reviewed Jul 26, 2023

View reviewed changes

cr

e596359

hwchase17 approved these changes Jul 26, 2023

View reviewed changes

cr

a9ca6c7

rlancemartin merged commit 7a00f17 into langchain-ai:master Jul 26, 2023
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web research retriever #8102

Web research retriever #8102

rlancemartin commented Jul 21, 2023 •

edited

Loading

vercel bot commented Jul 21, 2023 •

edited

Loading

hwchase17 Jul 24, 2023

hwchase17 Jul 24, 2023

hwchase17 Jul 24, 2023

hwchase17 Jul 24, 2023

hwchase17 Jul 24, 2023

vercel bot commented Jul 24, 2023

hwchase17 Jul 24, 2023

hwchase17 Jul 24, 2023

rlancemartin Jul 24, 2023 •

edited

Loading

rlancemartin Jul 24, 2023 •

edited

Loading

efriis Jul 25, 2023 •

edited

Loading

rlancemartin Jul 25, 2023 •

edited

Loading

hwchase17 Jul 25, 2023

rlancemartin Jul 25, 2023

rlancemartin Jul 25, 2023

rlancemartin Jul 25, 2023

rlancemartin Jul 25, 2023

rlancemartin Jul 25, 2023

hwchase17 Jul 26, 2023

hwchase17 Jul 26, 2023

		)


		DEFAULT_SEARCH_PROMPT = PromptTemplate(

Web research retriever #8102

Web research retriever #8102

Conversation

rlancemartin commented Jul 21, 2023 • edited Loading

vercel bot commented Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Jul 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

rlancemartin Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

efriis Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

rlancemartin Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin commented Jul 21, 2023 •

edited

Loading

vercel bot commented Jul 21, 2023 •

edited

Loading

rlancemartin Jul 24, 2023 •

edited

Loading

rlancemartin Jul 24, 2023 •

edited

Loading

efriis Jul 25, 2023 •

edited

Loading

rlancemartin Jul 25, 2023 •

edited

Loading