Create README.md

e-johnstonn · May 15, 2023 · bebf486 · bebf486
1 parent 79bd904
commit bebf486
Showing 1 changed file with 25 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,25 @@
+# docGPT
+
+docGPT is a powerful, locally-run tool for document summarization and querying using OpenAI's models. You retain **full control** over your documents and API keys, ensuring privacy and security. No paying third parties, no trusting sites with your API key, and no uploading sensitive documents to unknown sources.
+
+# Setup
+1. Clone the repository
+2. Download all requirements
+``pip install -r requirements.txt``
+3. Set your API key in test.env
+4. Navigate to the project directory and run
+```streamlit run main.py```
+5. Add your PDF's or .txt's to the documents folder in the project directory
+
+# How it works
+## Chat
+1. Creating and saving embeddings - once you load a file, it is broken into chunks and stored as a FAISS index in the 'embeddings' folder. These embeddings will be used if you load the document into the chat again.
+2. Retrieving, ranking, and processing results - a similarity search is performed on the index to get the top n results. These results are then re-ranked by a function that filters the original query and uses fuzzy matching to find the similarity in exact words between the query and the retrieved results.
+3. Output - the re-ranked results and the user query are passed to the llm, and the response is displayed.
+
+## Summarization
+1. Input - can handle both documents and YouTube URL's - will find the transcript and generate a summary based off of that.
+2.  Processing and embedding - before embedding, documents are stripped of any special tokens that might cause errors. Documents are embedded in chunks of varying size, depending on the overall document's size. 
+3. Clustering - once the documents are embedded, they are grouped into clusters using the K-means algorithm. The number of clusters can be predetermined (10) or variable (finds optimal number based on the elbow method). The embedding closest to each cluster centroid is retrieved - each cluster might represent a different theme or idea, and the retrieved embeddings are those that best encapsulate that theme or idea - that's the goal, at least.
+4. Summarization - summarization is performed in two steps. First, each retrieved embedding is matched with its corresponding text chunk. Each chunk is passed to GPT-3.5 in an individual call to the API - these calls are made in parallel. Once we have accumulated a summary for each chunk, the summaries are passed to GPT-3.5 or GPT-4 for the final summary.
+5. Output - the summary is displayed on the page and saved as a text file.