fix python pip packaging

update all imports bump version number in setup.py move all vdb specific code to its own file move vdb specific help info to docs/
AI-Northstar-Tech · Feb 6, 2024 · c60773c · c60773c
1 parent 385abe0
commit c60773c
Show file tree

Hide file tree

Showing 53 changed files with 973 additions and 586 deletions.
diff --git a/.gitignore b/.gitignore
@@ -29,3 +29,4 @@ share/python-wheels/
 MANIFEST
 *.csv
 output.txt
+src/vdf_io/notebooks/data/
diff --git a/README.md b/README.md
@@ -1,12 +1,14 @@
 # Vector IO
 
+[![PyPI version](https://badge.fury.io/py/vdf-io.svg)](https://badge.fury.io/py/vdf-io)
+
 This library uses a universal format for vector datasets to easily export and import data from all vector databases.
 
 See the [Contributing](#contributing) section to add support for your favorite vector database.
 
-## Supported Vector Databases 
+## Supported Vector Databases
 
-### (Request support for a VectorDB by voting/commenting here: https://github.com/AI-Northstar-Tech/vector-io/discussions/38)
+### (Request support for a VectorDB by voting/commenting here: <https://github.com/AI-Northstar-Tech/vector-io/discussions/38>)
 
 | Vector Database                | Import | Export |
 |--------------------------------|--------|--------|
@@ -79,6 +81,14 @@ interface VDFMeta {
 
 ## Installation
 
+### Using pip
+
+```bash
+pip install vdf-io
+```
+
+### From source
+
 ```bash
 git clone https://github.com/AI-Northstar-Tech/vector-io.git
 cd vector-io
@@ -88,146 +98,80 @@ pip install -r requirements.txt
 ## Export Script
 
 ```bash
-src/export_vdf.py --help
+export_vdf --help
+usage: export_vdf [-h] [-m MODEL_NAME]
+                  [--max_file_size MAX_FILE_SIZE]
+                  [--push_to_hub | --no-push_to_hub]
+                  [--public | --no-public]
+                  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}
+                  ...
 
-usage: export.py [-h] [-m MODEL_NAME] [--max_file_size MAX_FILE_SIZE]
-                 [--push_to_hub | --no-push_to_hub]
-                 {pinecone,qdrant} ...
-
-Export data from a vector database to VDF
+Export data from various vector databases to the VDF format
+for vector datasets
 
 options:
   -h, --help            show this help message and exit
   -m MODEL_NAME, --model_name MODEL_NAME
                         Name of model used
   --max_file_size MAX_FILE_SIZE
-                        Maximum file size in MB (default: 1024)
+                        Maximum file size in MB (default:
+                        1024)
   --push_to_hub, --no-push_to_hub
                         Push to hub
+  --public, --no-public
+                        Make dataset public (default:
+                        False)
 
 Vector Databases:
   Choose the vectors database to export data from
 
-  {pinecone,qdrant,vertexai_vectorsearch}
-    pinecone                 Export data from Pinecone
-    qdrant                   Export data from Qdrant
-    vertexai_vectorsearch    Export data from Vertex AI Vector Search
-```
-
-```bash
-src/export_vdf.py pinecone --help
-usage: export.py pinecone [-h] [-e ENVIRONMENT] [-i INDEX]
-                          [-s ID_RANGE_START]
-                          [--id_range_end ID_RANGE_END]
-                          [-f ID_LIST_FILE]
-                          [--modify_to_search MODIFY_TO_SEARCH]
-
-options:
-  -h, --help            show this help message and exit
-  -e ENVIRONMENT, --environment ENVIRONMENT
-                        Environment of Pinecone instance
-  -i INDEX, --index INDEX
-                        Name of index to export
-  -s ID_RANGE_START, --id_range_start ID_RANGE_START
-                        Start of id range
-  --id_range_end ID_RANGE_END
-                        End of id range
-  -f ID_LIST_FILE, --id_list_file ID_LIST_FILE
-                        Path to id list file
-  --modify_to_search MODIFY_TO_SEARCH
-                        Allow modifying data to search
-```
-
-```bash
-src/export_vdf.py qdrant --help
-usage: export.py qdrant [-h] [-u URL] [-c COLLECTIONS]
-
-options:
-  -h, --help            show this help message and exit
-  -u URL, --url URL     Location of Qdrant instance
-  -c COLLECTIONS, --collections COLLECTIONS
-                        Names of collections to export
-```
-
-```bash
-src/export_vdf.py milvus --help
-usage: export_vdf.py milvus [-h] [-u URI] [-t TOKEN] [-c COLLECTIONS]
-
-optional arguments:
-  -h, --help            show this help message and exit
-  -u URI, --uri URI     Milvus connection URI
-  -t TOKEN, --token TOKEN
-                        Milvus connection token
-  -c COLLECTIONS, --collections COLLECTIONS
-                        Names of collections to export
-```
-
-```bash
-src/export_vdf.py vertexai_vectorsearch --help
-usage: export_vdf.py vertexai_vectorsearch [-h] [-p PROJECT_ID] [-i INDEX]
-                          [-c GCLOUD_CREDENTIALS_FILE]
-
-options:
-  -h, --help            show this help message and exit
-  -p PROJECT_ID, --project-id PROJECT_ID
-                        Google Cloud Project ID
-  -i INDEX, --index INDEX
-                        Name of index/indexes to export (comma-separated)
-  -c GCLOUD_CREDENTIALS_FILE, --gcloud-credentials-file GCLOUD_CREDENTIALS_FILE
-                        Google Cloud Service Account Credentials file
+  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}
+    pinecone            Export data from Pinecone
+    qdrant              Export data from Qdrant
+    kdbai               Export data from KDB.AI
+    milvus              Export data from Milvus
+    vertexai_vectorsearch
+                        Export data from Vertex AI Vector
+                        Search
 ```
 
 ## Import script
 
 ```bash
-src/import_vdf.py --help
-usage: import_vdf.py [-h] [-d DIR] {pinecone,qdrant} ...
+import_vdf --help
+usage: import_vdf [-h] [-d DIR] [-s | --subset | --no-subset]
+                  [--create_new | --no-create_new]
+                  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}
+                  ...
 
 Import data from VDF to a vector database
 
 options:
-  -h, --help         show this help message and exit
-  -d DIR, --dir DIR  Directory to import
+  -h, --help            show this help message and exit
+  -d DIR, --dir DIR     Directory to import
+  -s, --subset, --no-subset
+                        Import a subset of data (default: False)
+  --create_new, --no-create_new
+                        Create a new index (default: False)
 
 Vector Databases:
   Choose the vectors database to export data from
 
-  {pinecone,qdrant}
-    pinecone         Import data to Pinecone
-    qdrant           Import data to Qdrant
-
-src/import_vdf.py pinecone --help
-usage: import_vdf.py pinecone [-h] [-e ENVIRONMENT]
-
-options:
-  -h, --help            show this help message and exit
-  -e ENVIRONMENT, --environment ENVIRONMENT
-                        Pinecone environment
-
-src/import_vdf.py qdrant --help  
-usage: import_vdf.py qdrant [-h] [-u URL]
-
-options:
-  -h, --help         show this help message and exit
-  -u URL, --url URL  Qdrant url
-
-src/import_vdf.py vertexai_vectorsearch --help
-usage: import_vdf.py vertexai_vectorsearch [-h] [-p PROJECT_ID] [-l REGION]
-
-options:
-  -h, --help            show this help message and exit
-  -p PROJECT_ID, --project-id PROJECT_ID
-                        Google Cloud Project ID
-  -l REGION, --location REGION
-                        Google Cloud region hosting index
+  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}
+    milvus              Import data to Milvus
+    pinecone            Import data to Pinecone
+    qdrant              Import data to Qdrant
+    vertexai_vectorsearch
+                        Import data to Vertex AI Vector Search
+    kdbai               Import data to KDB.AI
 ```
 
 ## Re-embed script
 
 This Python script is used to re-embed a vector dataset. It takes a directory of vector dataset in the VDF format and re-embeds it using a new model. The script also allows you to specify the name of the column containing text to be embedded.
 
 ```bash
-src/reembed.py --help
+reembed.py --help
 usage: reembed.py [-h] -d DIR [-m NEW_MODEL_NAME]
                   [-t TEXT_COLUMN]
 
@@ -247,7 +191,7 @@ options:
 ## Examples
 
 ```bash
-./export_vdf.py -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter
+export_vdf -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter
 ```
 
 Follow the prompt to select the index and id range to export.
@@ -263,17 +207,17 @@ Steps to add a new vector database (ABC):
 
 **Export**:
 
-1. Add a new subparser in `src/export_vdf.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
-2. Add a new file in `src/export_vdf/` for the new vector database. This file should define a class ExportABC which inherits from ExportVDF.
+1. Add a new subparser in `export_vdf_cli.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
+2. Add a new file in `src/vdf_io/export_vdf/` for the new vector database. This file should define a class ExportABC which inherits from ExportVDF.
 3. Specify a DB_NAME_SLUG for the class
 4. The class should implement the get_data() function to download points (in a batched manner) with all the metadata from the specified index of the vector database. This data should be stored in a series of parquet files/folders.
 The metadata should be stored in a json file with the [schema above](#universal-vector-dataset-format-vdf-specification).
 5. Use the script to export data from an example index of the vector database and verify that the data is exported correctly.
 
 **Import**:
 
-1. Add a new subparser in `src/import_vdf.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
-2. Add a new file in `src/import_vdf/` for the new vector database. This file should define a class ImportABC which inherits from ImportVDF. It should implement the upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read fro mthe VDF_META.json file in the vdf folder.
+1. Add a new subparser in `import_vdf_cli.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
+2. Add a new file in `src/vdf_io/import_vdf/` for the new vector database. This file should define a class ImportABC which inherits from ImportVDF. It should implement the upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read fro mthe VDF_META.json file in the vdf folder.
 3. Use the script to import data from the example vdf dataset exported in the previous step and verify that the data is imported correctly.
 
 ### Changing the VDF specification