Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.idea		.idea
documents		documents
embeddings		embeddings
models		models
summaries		summaries
README.md		README.md
chat_utils.py		chat_utils.py
main.py		main.py
my_prompts.py		my_prompts.py
requirements.txt		requirements.txt
streamlit_app_utils.py		streamlit_app_utils.py
summary_utils.py		summary_utils.py
test.env		test.env

Repository files navigation

docGPT

docGPT is a powerful, locally-run tool for document summarization and querying using OpenAI's models. You retain full control over your documents and API keys, ensuring privacy and security. No paying third parties, no trusting sites with your API key, and no uploading sensitive documents to unknown sources.

Setup

Clone the repository
Download all requirements pip install -r requirements.txt
Set your API key in test.env
Navigate to the project directory and run streamlit run main.py
Add your PDF's or .txt's to the documents folder in the project directory

How it works

Chat

Creating and saving embeddings - once you load a file, it is broken into chunks and stored as a FAISS index in the 'embeddings' folder. These embeddings will be used if you load the document into the chat again.
Retrieving, ranking, and processing results - a similarity search is performed on the index to get the top n results. These results are then re-ranked by a function that filters the original query and uses fuzzy matching to find the similarity in exact words between the query and the retrieved results.
Output - the re-ranked results and the user query are passed to the llm, and the response is displayed.

Summarization

Input - can handle both documents and YouTube URL's - will find the transcript and generate a summary based off of that.
Processing and embedding - before embedding, documents are stripped of any special tokens that might cause errors. Documents are embedded in chunks of varying size, depending on the overall document's size.
Clustering - once the documents are embedded, they are grouped into clusters using the K-means algorithm. The number of clusters can be predetermined (10) or variable (finds optimal number based on the elbow method). The embedding closest to each cluster centroid is retrieved - each cluster might represent a different theme or idea, and the retrieved embeddings are those that best encapsulate that theme or idea - that's the goal, at least.
Summarization - summarization is performed in two steps. First, each retrieved embedding is matched with its corresponding text chunk. Each chunk is passed to GPT-3.5 in an individual call to the API - these calls are made in parallel. Once we have accumulated a summary for each chunk, the summaries are passed to GPT-3.5 or GPT-4 for the final summary.
Output - the summary is displayed on the page and saved as a text file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docGPT

Setup

How it works

Chat

Summarization

About

Releases

Packages

Contributors 2

Languages

License

e-johnstonn/BriefGPT

Folders and files

Latest commit

History

Repository files navigation

docGPT

Setup

How it works

Chat

Summarization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages