Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Duplicate primary key values leads to inconsistent query results. #36199

Open
1 task done
XbaoWu opened this issue Sep 12, 2024 · 3 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug resolution/by-design This behavior described in the issue is by design

Comments

@XbaoWu
Copy link

XbaoWu commented Sep 12, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.5
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2): Java SDK 2.4.1
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c / 64G
- GPU: 
- Others:

Current Behavior

When I inserted two entities with the same primary key field into the collection, I found the following :

  1. When I query, it always returns the oldest entity.
  2. When I search, the returned entity is the most similar of the two duplicate entities. When the similarity is the same, the oldest entity is returned.
  3. When I use count (*), I find that the return result is always 2.

Expected Behavior

The results I expect for the above behavior are as follows :

  1. When I query, the returned entities should always be the same (Current 2.4.5 is satisfied).
  2. When I search, Only calculate the similarity with the oldest entity ( here refers to the oldest is consistent with the visibility of the query ) or maintain the status quo.
  3. When I use count (*), the number of entries returned using count ( * ) is the number of visible entities

According to the source code, I probably understand that these two entities with duplicate primary keys actually exist, but only because the entities with duplicate primary keys are filtered after the 'reduce' operation of QueryNode.

I understand that the maximum number of entities returned in a query or search is 1, so it would be better to use the count function to return the number of visible entities 1 instead of 2 before the primary key uniqueness is not implemented. ( Here 1 is only an example )

Steps To Reproduce

1. Create a collection that does not enable automatic id
2. The vector index type can be randomly selected, and the collection is loaded after the index is created.
3. Insert two entities with the same value of two primary key fields.
4. Query or search the created collection.
5. Using the count function for the collection

Milvus Log

No abnormal log information

Anything else?

No more information yet

@XbaoWu XbaoWu added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 12, 2024
@yanliang567
Copy link
Contributor

@XbaoWu sorry for confusing, but for now this is by design. We are working on PK dedup to avoid it.

/assign @XbaoWu
/unassign

@sre-ci-robot sre-ci-robot assigned XbaoWu and unassigned yanliang567 Sep 12, 2024
@yanliang567 yanliang567 added resolution/by-design This behavior described in the issue is by design and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 12, 2024
@XbaoWu
Copy link
Author

XbaoWu commented Sep 12, 2024

@XbaoWu sorry for confusing, but for now this is by design. We are working on PK dedup to avoid it.

/assign @XbaoWu /unassign
OK, I understand. Thank you for your reply.

@xiaofan-luan
Copy link
Contributor

@yanliang567
Seems that the duplicated pk is an issue we need to think of. this brings so much misunderstanding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug resolution/by-design This behavior described in the issue is by design
Projects
None yet
Development

No branches or pull requests

3 participants