Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account #498

Merged

Conversation

sabaimran
Copy link
Member

@sabaimran sabaimran commented Oct 11, 2023

Incoming

  • Partition configuration for indexing local data based on user accounts
  • Store indexed data in an underlying postgres db using the pgvector extension
  • Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time
  • Apply filters using SQL queries
  • Start removing many server-level configuration settings
  • Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB.
  • Update the Docker image and docker-compose.yml to work with the new application design

Closes #466, Closes #345

Closes #195. On my local analysis, memory consumption in my machine on the prior setup with my local org notes was around 10-12 GB of RAM. With the pg_vector integration, it's around 2-3 GB.

- Write processed text data to the DB using the embeddings service
- Read data from the DB for search and chat
- Update all text to jsonl processesors to use the embeddings service that writes to the DB
@khoj-ai khoj-ai deleted a comment from gitguardian bot Oct 14, 2023
@sabaimran sabaimran changed the base branch from master to features/multi-user-support-khoj October 14, 2023 04:12
@khoj-ai khoj-ai deleted a comment from gitguardian bot Oct 14, 2023
@gitguardian
Copy link

gitguardian bot commented Oct 15, 2023

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id Secret Commit Filename
8175313 Django Secret Key 869c37f src/app/settings.py View secret
8175313 Django Secret Key 6fa925e src/app/settings.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@sabaimran sabaimran marked this pull request as ready for review October 16, 2023 22:05
Copy link
Member

@debanjum debanjum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on these big set of changes! I did a review pass and left some comments but it is not comprehensive

env:
DEBIAN_FRONTEND: noninteractive
run : |
apt install -y postgresql postgresql-client && apt install -y postgresql-server-dev-14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
apt install -y postgresql postgresql-client && apt install -y postgresql-server-dev-14
apt install -y postgresql postgresql-client postgresql-server-dev-14

run: |
sudo apt update && sudo apt install -y libegl1
apt update && apt install -y libegl1 sqlite3 libsqlite3-dev libsqlite3-0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sqlite required here, given we're using postgres?

@@ -43,17 +53,37 @@ jobs:
with:
python-version: ${{ matrix.python_version }}

- name: Install Git
run: |
apt update && apt install -y git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is git required here? Is it because ubuntu-jammy doesn't have git pre-installed and pip needs it to figure out the git (tag) version of the codebase?


- name: 🌡️ Validate Application
run: pre-commit run --hook-stage manual --all
run: sed -i 's/dynamic = \["version"\]/version = "0.0.0"/' pyproject.toml && pip install --upgrade .[dev]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the version need to be reset here? I know pip install uses git tag version for figuring out the version for khoj. But given that this was working earlier, what's changed now? Moving to ubuntu-jammy? This maybe related to why git is required comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it's because it's running in a containerized environment now, which requires this additional snippet for including the version.

Comment on lines +107 to +109
if not config:
return None
return config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the below suggested code be equivalent (and simpler) or can config be also be False or some such?

Suggested change
if not config:
return None
return config
return config

Comment on lines +204 to +205
if len(date_filters) > 0:
min_date, max_date = date_filters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this unpacking operation of date_filters into min_date, max_date fail if len(date_filters) == 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, but the only possible responses are None, empty list, and list of two.


# Dynamically generate search type enum by merging core search types with configured plugin search types
return Enum("SearchType", merge_dicts(core_search_types, plugin_search_types))
return Enum("SearchType", merge_dicts(core_search_types, {}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge_dicts can be removed, given the plugin search types are removed?

Suggested change
return Enum("SearchType", merge_dicts(core_search_types, {}))
return Enum("SearchType", core_search_types)

@sabaimran sabaimran merged commit 216acf5 into features/multi-user-support-khoj Oct 26, 2023
5 checks passed
@sabaimran sabaimran deleted the features/use-config-with-multi-user branch October 26, 2023 16:42
sabaimran added a commit that referenced this pull request Nov 1, 2023
- 0865416: Add better parsing for XML files
- f3acfac: Add a try/catch around the dateparser in order to avoid internal server errors in app
- 7d43cd6: Chunk embeddings generation in order to avoid large memory load
- e02d751: Addresses comments from PR #498 
- a3f393e: Addresses comments from PR #503 
- 66eb078: Addresses comments from PR #511 
- Address various items in #527
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pg_vector to store embeddings in Khoj Integrate vector DB for better memory usage
2 participants