[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

qwevdb · 2024-09-29T23:08:02Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_FLAT index and IP metric atfer two data insertions is different from GPU_IVF_PQ index and IP metric with the same parameters.
If inserting data only once or merging two data insertions, the results are the same.

Expected Behavior

Both IVF_FLAT and GPU_IVF_FLAT index with IP metric and the same parameters can produce the same results.

Steps To Reproduce

Create an IVF_FLAT index with IP metric in the collection.
Insert data into collection twice.
Search

import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(0)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  
    
dim = 824
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.VARCHAR, max_length=255)
field_2 = FieldSchema(name='field_2', dtype=DataType.INT64)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
# index_params = {'index_type': 'GPU_IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
collection.create_index("vector", index_params, timeout=100)

# first data insert
dataset = []
number = 1527
for i in range(0,number + 0):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

# second data insert
dataset = []
number = 486
for i in range(1527,number + 1527):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
res1 = collection.search(
    data=[query_vector],
    anns_field="vector",
    param={"metric_type": "IP",
            "params": {'nprobe': 49}},
    limit = 3,
    expr='(field_4 == "yellow" || (field_3 < "red" and field_2 not in [68, 100, 69, 28, 24]))',
    timeout=100
    )
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 449, distance: 295739.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}']"]

Change 'index_type': 'IVF_FLAT' to 'index_type': 'GPU_IVF_FLAT' in index_params and run again.

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}', 'id: 672, distance: 260236.0, entity: {}']"]

If inserting data only once or merging two data insertions, the results of using IVF_FLAT and GPU_IVF_FLAT index are the same.

yanliang567 · 2024-09-30T02:09:55Z

dup to #36610

liliu-z · 2024-09-30T03:00:25Z

/assign
/assign @Presburger

Presburger · 2024-09-30T03:35:40Z

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

qwevdb · 2024-09-30T04:31:23Z

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

It doesn't seem to be a problem with nprobe.

qwevdb added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2024

qwevdb assigned yanliang567 Sep 29, 2024

yanliang567 closed this as completed Sep 30, 2024

sre-ci-robot assigned liliu-z and Presburger Sep 30, 2024

qwevdb mentioned this issue Sep 30, 2024

[Bug]: Using GPU_IVF_FLAT and L2 to search wtih the same parameters as IVF_FLAT atfer two or more data insertions brings different results #36609

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

qwevdb commented Sep 29, 2024 •

edited

Loading

yanliang567 commented Sep 30, 2024

liliu-z commented Sep 30, 2024

Presburger commented Sep 30, 2024

qwevdb commented Sep 30, 2024

[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

Comments

qwevdb commented Sep 29, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

yanliang567 commented Sep 30, 2024

liliu-z commented Sep 30, 2024

Presburger commented Sep 30, 2024

qwevdb commented Sep 30, 2024

qwevdb commented Sep 29, 2024 •

edited

Loading