Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

Closed
1 task done
qwevdb opened this issue Sep 29, 2024 · 4 comments
Assignees
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@qwevdb
Copy link

qwevdb commented Sep 29, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_FLAT index and IP metric atfer two data insertions is different from GPU_IVF_PQ index and IP metric with the same parameters.
If inserting data only once or merging two data insertions, the results are the same.

Expected Behavior

Both IVF_FLAT and GPU_IVF_FLAT index with IP metric and the same parameters can produce the same results.

Steps To Reproduce

  1. Create an IVF_FLAT index with IP metric in the collection.
  2. Insert data into collection twice.
  3. Search
import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(0)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  
    
dim = 824
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.VARCHAR, max_length=255)
field_2 = FieldSchema(name='field_2', dtype=DataType.INT64)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
# index_params = {'index_type': 'GPU_IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
collection.create_index("vector", index_params, timeout=100)

# first data insert
dataset = []
number = 1527
for i in range(0,number + 0):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

# second data insert
dataset = []
number = 486
for i in range(1527,number + 1527):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
res1 = collection.search(
    data=[query_vector],
    anns_field="vector",
    param={"metric_type": "IP",
            "params": {'nprobe': 49}},
    limit = 3,
    expr='(field_4 == "yellow" || (field_3 < "red" and field_2 not in [68, 100, 69, 28, 24]))',
    timeout=100
    )
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 449, distance: 295739.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}']"]
  1. Change 'index_type': 'IVF_FLAT' to 'index_type': 'GPU_IVF_FLAT' in index_params and run again.

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}', 'id: 672, distance: 260236.0, entity: {}']"]
  1. If inserting data only once or merging two data insertions, the results of using IVF_FLAT and GPU_IVF_FLAT index are the same.
@qwevdb qwevdb added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2024
@yanliang567
Copy link
Contributor

dup to #36610

@liliu-z
Copy link
Member

liliu-z commented Sep 30, 2024

/assign
/assign @Presburger

@Presburger
Copy link
Contributor

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

@qwevdb
Copy link
Author

qwevdb commented Sep 30, 2024

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

It doesn't seem to be a problem with nprobe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants