Skip to content

Commit

Permalink
fix python pip packaging
Browse files Browse the repository at this point in the history
update all imports
bump version number in setup.py
move all vdb specific code to its own file
move vdb specific help info to docs/
  • Loading branch information
dhruv-anand-aintech committed Feb 6, 2024
1 parent 385abe0 commit c60773c
Show file tree
Hide file tree
Showing 53 changed files with 973 additions and 586 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,4 @@ share/python-wheels/
MANIFEST
*.csv
output.txt
src/vdf_io/notebooks/data/
172 changes: 58 additions & 114 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# Vector IO

[![PyPI version](https://badge.fury.io/py/vdf-io.svg)](https://badge.fury.io/py/vdf-io)

This library uses a universal format for vector datasets to easily export and import data from all vector databases.

See the [Contributing](#contributing) section to add support for your favorite vector database.

## Supported Vector Databases
## Supported Vector Databases

### (Request support for a VectorDB by voting/commenting here: https://github.com/AI-Northstar-Tech/vector-io/discussions/38)
### (Request support for a VectorDB by voting/commenting here: <https://github.com/AI-Northstar-Tech/vector-io/discussions/38>)

| Vector Database | Import | Export |
|--------------------------------|--------|--------|
Expand Down Expand Up @@ -79,6 +81,14 @@ interface VDFMeta {

## Installation

### Using pip

```bash
pip install vdf-io
```

### From source

```bash
git clone https://github.com/AI-Northstar-Tech/vector-io.git
cd vector-io
Expand All @@ -88,146 +98,80 @@ pip install -r requirements.txt
## Export Script

```bash
src/export_vdf.py --help
export_vdf --help
usage: export_vdf [-h] [-m MODEL_NAME]
[--max_file_size MAX_FILE_SIZE]
[--push_to_hub | --no-push_to_hub]
[--public | --no-public]
{pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}
...

usage: export.py [-h] [-m MODEL_NAME] [--max_file_size MAX_FILE_SIZE]
[--push_to_hub | --no-push_to_hub]
{pinecone,qdrant} ...

Export data from a vector database to VDF
Export data from various vector databases to the VDF format
for vector datasets

options:
-h, --help show this help message and exit
-m MODEL_NAME, --model_name MODEL_NAME
Name of model used
--max_file_size MAX_FILE_SIZE
Maximum file size in MB (default: 1024)
Maximum file size in MB (default:
1024)
--push_to_hub, --no-push_to_hub
Push to hub
--public, --no-public
Make dataset public (default:
False)

Vector Databases:
Choose the vectors database to export data from

{pinecone,qdrant,vertexai_vectorsearch}
pinecone Export data from Pinecone
qdrant Export data from Qdrant
vertexai_vectorsearch Export data from Vertex AI Vector Search
```

```bash
src/export_vdf.py pinecone --help
usage: export.py pinecone [-h] [-e ENVIRONMENT] [-i INDEX]
[-s ID_RANGE_START]
[--id_range_end ID_RANGE_END]
[-f ID_LIST_FILE]
[--modify_to_search MODIFY_TO_SEARCH]

options:
-h, --help show this help message and exit
-e ENVIRONMENT, --environment ENVIRONMENT
Environment of Pinecone instance
-i INDEX, --index INDEX
Name of index to export
-s ID_RANGE_START, --id_range_start ID_RANGE_START
Start of id range
--id_range_end ID_RANGE_END
End of id range
-f ID_LIST_FILE, --id_list_file ID_LIST_FILE
Path to id list file
--modify_to_search MODIFY_TO_SEARCH
Allow modifying data to search
```

```bash
src/export_vdf.py qdrant --help
usage: export.py qdrant [-h] [-u URL] [-c COLLECTIONS]

options:
-h, --help show this help message and exit
-u URL, --url URL Location of Qdrant instance
-c COLLECTIONS, --collections COLLECTIONS
Names of collections to export
```

```bash
src/export_vdf.py milvus --help
usage: export_vdf.py milvus [-h] [-u URI] [-t TOKEN] [-c COLLECTIONS]

optional arguments:
-h, --help show this help message and exit
-u URI, --uri URI Milvus connection URI
-t TOKEN, --token TOKEN
Milvus connection token
-c COLLECTIONS, --collections COLLECTIONS
Names of collections to export
```

```bash
src/export_vdf.py vertexai_vectorsearch --help
usage: export_vdf.py vertexai_vectorsearch [-h] [-p PROJECT_ID] [-i INDEX]
[-c GCLOUD_CREDENTIALS_FILE]

options:
-h, --help show this help message and exit
-p PROJECT_ID, --project-id PROJECT_ID
Google Cloud Project ID
-i INDEX, --index INDEX
Name of index/indexes to export (comma-separated)
-c GCLOUD_CREDENTIALS_FILE, --gcloud-credentials-file GCLOUD_CREDENTIALS_FILE
Google Cloud Service Account Credentials file
{pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}
pinecone Export data from Pinecone
qdrant Export data from Qdrant
kdbai Export data from KDB.AI
milvus Export data from Milvus
vertexai_vectorsearch
Export data from Vertex AI Vector
Search
```

## Import script

```bash
src/import_vdf.py --help
usage: import_vdf.py [-h] [-d DIR] {pinecone,qdrant} ...
import_vdf --help
usage: import_vdf [-h] [-d DIR] [-s | --subset | --no-subset]
[--create_new | --no-create_new]
{milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}
...

Import data from VDF to a vector database

options:
-h, --help show this help message and exit
-d DIR, --dir DIR Directory to import
-h, --help show this help message and exit
-d DIR, --dir DIR Directory to import
-s, --subset, --no-subset
Import a subset of data (default: False)
--create_new, --no-create_new
Create a new index (default: False)

Vector Databases:
Choose the vectors database to export data from

{pinecone,qdrant}
pinecone Import data to Pinecone
qdrant Import data to Qdrant

src/import_vdf.py pinecone --help
usage: import_vdf.py pinecone [-h] [-e ENVIRONMENT]

options:
-h, --help show this help message and exit
-e ENVIRONMENT, --environment ENVIRONMENT
Pinecone environment

src/import_vdf.py qdrant --help
usage: import_vdf.py qdrant [-h] [-u URL]

options:
-h, --help show this help message and exit
-u URL, --url URL Qdrant url

src/import_vdf.py vertexai_vectorsearch --help
usage: import_vdf.py vertexai_vectorsearch [-h] [-p PROJECT_ID] [-l REGION]

options:
-h, --help show this help message and exit
-p PROJECT_ID, --project-id PROJECT_ID
Google Cloud Project ID
-l REGION, --location REGION
Google Cloud region hosting index
{milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}
milvus Import data to Milvus
pinecone Import data to Pinecone
qdrant Import data to Qdrant
vertexai_vectorsearch
Import data to Vertex AI Vector Search
kdbai Import data to KDB.AI
```

## Re-embed script

This Python script is used to re-embed a vector dataset. It takes a directory of vector dataset in the VDF format and re-embeds it using a new model. The script also allows you to specify the name of the column containing text to be embedded.

```bash
src/reembed.py --help
reembed.py --help
usage: reembed.py [-h] -d DIR [-m NEW_MODEL_NAME]
[-t TEXT_COLUMN]

Expand All @@ -247,7 +191,7 @@ options:
## Examples

```bash
./export_vdf.py -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter
export_vdf -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter
```

Follow the prompt to select the index and id range to export.
Expand All @@ -263,17 +207,17 @@ Steps to add a new vector database (ABC):

**Export**:

1. Add a new subparser in `src/export_vdf.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
2. Add a new file in `src/export_vdf/` for the new vector database. This file should define a class ExportABC which inherits from ExportVDF.
1. Add a new subparser in `export_vdf_cli.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
2. Add a new file in `src/vdf_io/export_vdf/` for the new vector database. This file should define a class ExportABC which inherits from ExportVDF.
3. Specify a DB_NAME_SLUG for the class
4. The class should implement the get_data() function to download points (in a batched manner) with all the metadata from the specified index of the vector database. This data should be stored in a series of parquet files/folders.
The metadata should be stored in a json file with the [schema above](#universal-vector-dataset-format-vdf-specification).
5. Use the script to export data from an example index of the vector database and verify that the data is exported correctly.

**Import**:

1. Add a new subparser in `src/import_vdf.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
2. Add a new file in `src/import_vdf/` for the new vector database. This file should define a class ImportABC which inherits from ImportVDF. It should implement the upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read fro mthe VDF_META.json file in the vdf folder.
1. Add a new subparser in `import_vdf_cli.py` for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
2. Add a new file in `src/vdf_io/import_vdf/` for the new vector database. This file should define a class ImportABC which inherits from ImportVDF. It should implement the upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read fro mthe VDF_META.json file in the vdf folder.
3. Use the script to import data from the example vdf dataset exported in the previous step and verify that the data is imported correctly.

### Changing the VDF specification
Expand Down
Loading

0 comments on commit c60773c

Please sign in to comment.