[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account #498

sabaimran · 2023-10-11T01:03:50Z

Incoming

Partition configuration for indexing local data based on user accounts
Store indexed data in an underlying postgres db using the pgvector extension
Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time
Apply filters using SQL queries
Start removing many server-level configuration settings
Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB.
Update the Docker image and docker-compose.yml to work with the new application design

Closes #466, Closes #345

Closes #195. On my local analysis, memory consumption in my machine on the prior setup with my local org notes was around 10-12 GB of RAM. With the pg_vector integration, it's around 2-3 GB.

…user

…nfig-with-multi-user

- Write processed text data to the DB using the embeddings service - Read data from the DB for search and chat - Update all text to jsonl processesors to use the embeddings service that writes to the DB

…exing data

gitguardian · 2023-10-15T02:26:23Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	Secret	Commit	Filename
8175313	Django Secret Key	`869c37f`	src/app/settings.py	View secret
8175313	Django Secret Key	`6fa925e`	src/app/settings.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}

…nfig-with-multi-user

… relevant data settings

…tress testing

debanjum

Thanks for all the work on these big set of changes! I did a review pass and left some comments but it is not comprehensive

debanjum · 2023-10-24T23:57:55Z

.github/workflows/test.yml

+        env:
+          DEBIAN_FRONTEND: noninteractive
+        run : |
+          apt install -y postgresql postgresql-client && apt install -y postgresql-server-dev-14


Suggested change

apt install -y postgresql postgresql-client && apt install -y postgresql-server-dev-14

apt install -y postgresql postgresql-client postgresql-server-dev-14

debanjum · 2023-10-24T23:58:16Z

.github/workflows/test.yml

        run: |
-          sudo apt update && sudo apt install -y libegl1
+          apt update && apt install -y libegl1 sqlite3 libsqlite3-dev libsqlite3-0


Is sqlite required here, given we're using postgres?

debanjum · 2023-10-24T23:59:16Z

.github/workflows/test.yml

@@ -43,17 +53,37 @@ jobs:
        with:
          python-version: ${{ matrix.python_version }}

+      - name: Install Git
+        run: |
+          apt update && apt install -y git


Is git required here? Is it because ubuntu-jammy doesn't have git pre-installed and pip needs it to figure out the git (tag) version of the codebase?

debanjum · 2023-10-25T00:01:18Z

.github/workflows/test.yml

-
-      - name: 🌡️ Validate Application
-        run: pre-commit run --hook-stage manual --all
+        run: sed -i 's/dynamic = \["version"\]/version = "0.0.0"/' pyproject.toml && pip install --upgrade .[dev]


Why does the version need to be reset here? I know pip install uses git tag version for figuring out the version for khoj. But given that this was working earlier, what's changed now? Moving to ubuntu-jammy? This maybe related to why git is required comment.

Correct, it's because it's running in a containerized environment now, which requires this additional snippet for including the version.

debanjum · 2023-10-25T00:06:25Z

src/database/adapters/__init__.py

+    if not config:
+        return None
+    return config


Will the below suggested code be equivalent (and simpler) or can config be also be False or some such?

Suggested change

if not config:

return None

return config

return config

debanjum · 2023-10-25T00:14:32Z

src/database/adapters/__init__.py

+        if len(date_filters) > 0:
+            min_date, max_date = date_filters


Wouldn't this unpacking operation of date_filters into min_date, max_date fail if len(date_filters) == 1?

Good question, but the only possible responses are None, empty list, and list of two.

debanjum · 2023-10-25T03:05:28Z

src/khoj/configure.py


    # Dynamically generate search type enum by merging core search types with configured plugin search types
-    return Enum("SearchType", merge_dicts(core_search_types, plugin_search_types))
+    return Enum("SearchType", merge_dicts(core_search_types, {}))


The merge_dicts can be removed, given the plugin search types are removed?

Suggested change

return Enum("SearchType", merge_dicts(core_search_types, {}))

return Enum("SearchType", core_search_types)

- 0865416: Add better parsing for XML files - f3acfac: Add a try/catch around the dateparser in order to avoid internal server errors in app - 7d43cd6: Chunk embeddings generation in order to avoid large memory load - e02d751: Addresses comments from PR #498 - a3f393e: Addresses comments from PR #503 - 66eb078: Addresses comments from PR #511 - Address various items in #527

sabaimran added 30 commits September 13, 2023 17:57

Initial commit with functional django scaffolding

869c37f

Fix static files configuration to support relevant folders

291263a

Integrate django all auth for google sign in

70c037e

Remove django all auth, add fastapi auth routes

161e246

include auth routes in main app

243ee54

Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-…

3acdc6b

…user

Run migrations on app start

2222250

Fix merge conflicts in pyproject.toml

4bd2ede

Fix resolution in merge conflicts

684a3f0

Include httpx and itsdangerous modules for authlib

70ec6a2

Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-…

bfb7aaa

…user

Add concept of user authentication to the request session via GoogleUser

0a5062a

Add in google user migration

b4d81ce

Add in google user migration

b521d9f

Fix migration ordering issues

a65937e

Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-…

da3bf5f

…user

Add relation ot config with github, notion configurations

79d666d

Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-…

7d8ae35

…user

Use authentication middleware and backend for authenticated khoj users

c69b548

Remove secrets and unnecessary files

6fa925e

Rename migrations to drop the Question, Answer test models

4cbfe17

Remove further trial scaffolding

940993f

Remove unnecessary code and parameters

e4b4668

Update unit tests and make Google auth credentials optional

e3b61a4

Init changes for using new DB tables and objects for saving user config

2d56eda

Merge branch 'master' of github.com:khoj-ai/khoj into features/use-co…

52bcfe0

…nfig-with-multi-user

Add migrations for embeddings support and read data from the vector DB

6d64506

- Write processed text data to the DB using the embeddings service - Read data from the DB for search and chat - Update all text to jsonl processesors to use the embeddings service that writes to the DB

Simplify database migrations

c813ae9

Start updating indexer, api code to be user aware

e191db1

Read/write settings from github and use the user credentials when ind…

da9a41d

…exing data

sabaimran added 4 commits October 13, 2023 17:18

Update the docker setup to work with the new application design

a7ce54c

Fix retrieval of KhojUser from request

36c227e

Set type for query cache

615a633

Add basic instructions for using the new application setup

a6a4631

khoj-ai deleted a comment from gitguardian bot Oct 14, 2023

sabaimran changed the base branch from master to features/multi-user-support-khoj October 14, 2023 04:12

khoj-ai deleted a comment from gitguardian bot Oct 14, 2023

sabaimran mentioned this pull request Oct 15, 2023

[Multi-User]: Part 0 - Add support for logging in with Google #487

Merged

Add in cross encoder and rerank steps in the search path

be2176f

sabaimran added 10 commits October 14, 2023 19:52

Resolve merge conflicts after Part 0

d4c2305

Fix null check issues

4b37184

Resolve mypy linting issues for return type, parameters

9c9dbfa

Turn of telemtery if debug mode is enabled

ef18606

Move main.py back under /khoj, add a migration for HNSW index

5df9ce0

Update method to convert JSON config to DB objects

d71cc1a

Merge branch 'master' of github.com:khoj-ai/khoj into features/use-co…

654425c

…nfig-with-multi-user

Simplify some of the PDF parsing code and remove unused imports

e4652ba

Revert changes to PDF decoding and update settings pages to read from…

eeae28c

… relevant data settings

Remove HNSW index for now -- to follow-up at a later time with more s…

abf3bb2

…tress testing

sabaimran marked this pull request as ready for review October 16, 2023 22:05

sabaimran requested a review from debanjum October 16, 2023 22:05

sabaimran added 2 commits October 19, 2023 15:15

Resolve merge conflicts with master

39abefd

Update unit test after merging

24ef04a

debanjum reviewed Oct 25, 2023

View reviewed changes

debanjum approved these changes Oct 25, 2023

View reviewed changes

sabaimran mentioned this pull request Oct 26, 2023

[Multi-User Part 6]: Address small bugs and upstream PR comments #518

Merged

sabaimran merged commit 216acf5 into features/multi-user-support-khoj Oct 26, 2023
5 checks passed

sabaimran deleted the features/use-config-with-multi-user branch October 26, 2023 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account #498

[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account #498

sabaimran commented Oct 11, 2023 •

edited

Loading

gitguardian bot commented Oct 15, 2023 •

edited

Loading

debanjum left a comment

debanjum Oct 24, 2023

debanjum Oct 24, 2023

debanjum Oct 24, 2023

debanjum Oct 25, 2023

sabaimran Oct 26, 2023

debanjum Oct 25, 2023

debanjum Oct 25, 2023

sabaimran Oct 26, 2023

debanjum Oct 25, 2023

	apt install -y postgresql postgresql-client && apt install -y postgresql-server-dev-14
	apt install -y postgresql postgresql-client postgresql-server-dev-14

	return Enum("SearchType", merge_dicts(core_search_types, {}))
	return Enum("SearchType", core_search_types)

[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account #498

[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account #498

Conversation

sabaimran commented Oct 11, 2023 • edited Loading

Incoming

gitguardian bot commented Oct 15, 2023 • edited Loading

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

debanjum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sabaimran commented Oct 11, 2023 •

edited

Loading

gitguardian bot commented Oct 15, 2023 •

edited

Loading