Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Stop GPU index on CPU machine more user friendly instead of milvus crash #27589

Closed
1 task done
binbinlv opened this issue Oct 10, 2023 · 15 comments
Closed
1 task done
Assignees
Labels
func2.3.2 function issues in 2.3.2 kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Oct 10, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20231007-80eb5434-gpu
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

GPU index could be created successfully on CPU machine, and could be searched too.

>>> index_param = {"index_type": "GPU_IVF_FLAT", "metric_type": "L2", "params": {"nlist": 1024}}
>>> collection.create_index("float_vector", index_param, index_name="index_name_1")
Status(code=0, message=)
>>> default_search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
>>> limit = 10
>>> nq = 1
>>> collection.load()
>>> res = collection.search(vectors[:nq], "float_vector", default_search_params, limit, "int64 >= 0")
>>>
>>> res[0].ids
[0, 1372, 114, 900, 5283, 5652, 8776, 6182, 3621, 4557]

Expected Behavior

GPU index could not be created successfully on CPU machine, and report error

Steps To Reproduce

  1. deploy milvus (gpu image) on CPU machine
  2. create collection and create index
from pymilvus import CollectionSchema, FieldSchema
from pymilvus import Collection
from pymilvus import connections
from pymilvus import DataType
from pymilvus import Partition
from pymilvus import utility

connections.connect()

dim = 128
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
bool_field = FieldSchema(name="bool", dtype=DataType.BOOL)
string_field = FieldSchema(name="string", dtype=DataType.VARCHAR, max_length=65535)
json_field = FieldSchema(name="json_field", dtype=DataType.JSON)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[int64_field, float_field, bool_field, float_vector])
collection = Collection("test_search_collection_binbin_tmp_0", schema=schema)
import numpy as np
import random
nb = 10000
vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
res = collection.insert([[i for i in range(nb)], [np.float32(i) for i in range(nb)], [np.bool_(i) for i in range(nb)], vectors])
index_param = {"index_type": "GPU_IVF_FLAT", "metric_type": "L2", "params": {"nlist": 1024}}
collection.create_index("float_vector", index_param, index_name="index_name_1")

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22gpu-cpu-machine-wssbe.*%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

No response

@binbinlv binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 10, 2023
@binbinlv binbinlv added this to the 2.3.2 milestone Oct 10, 2023
@yanliang567
Copy link
Contributor

/assign @liliu-z @Presburger
/unassign

@sre-ci-robot sre-ci-robot assigned liliu-z and unassigned yanliang567 Oct 10, 2023
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 10, 2023
@liliu-z
Copy link
Member

liliu-z commented Oct 12, 2023

This is a GPU image, so we can create GPU index. And the case only involved 10K 128dim data, which didn't trigger index building at all. So this is as expected.

@Presburger
Copy link
Contributor

data size is so small, cannot triage GPU build stage.

@binbinlv
Copy link
Contributor Author

will try big data size.

@yanliang567
Copy link
Contributor

This is a GPU image, so we can create GPU index. And the case only involved 10K 128dim data, which didn't trigger index building at all. So this is as expected.
what is the size/rule to tigger the GPU index. @liliu-z

@yanliang567 yanliang567 added the func2.3.2 function issues in 2.3.2 label Oct 12, 2023
@Presburger
Copy link
Contributor

@yanliang567 while slow data can also trigger index build, but should flush\create\load,then the data will sealed.

@binbinlv
Copy link
Contributor Author

when inserting 5M data, then create GPU_IVF_FLAT on cpu machine, milvus crashed showing the following error in log:

[2023/10/13 02:51:59.738 +00:00] [DEBUG] [config/etcd_source.go:141] ["etcd refreshConfigurations"] [prefix=by-dev/config] [endpoints="[gpu-cpu-machine-qfljr-etcd:2379]"]
F20231013 02:51:59.877173    94 raft_utils.cc:24] [KNOWHERE][gpu_device_manager][milvus] CUDA error encountered at: file=/go/src/github.com/milvus-io/milvus/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/raft_utils.cc line=22: call='cudaGetDeviceCount(&device_counts)', Reason=cudaErrorInsufficientDriver:CUDA driver version is insufficient for CUDA runtime version

@binbinlv
Copy link
Contributor Author

Could we stop GPU index on CPU machine more user friendly? like report error in advance instead of milvus crash.

@binbinlv binbinlv changed the title [Bug]: GPU index could be created successfully on CPU machine [Bug]: Stop GPU index on CPU machine more user friendly instead of milvus crash Oct 13, 2023
@liliu-z
Copy link
Member

liliu-z commented Oct 13, 2023

This is a GPU image, so we can create GPU index. And the case only involved 10K 128dim data, which didn't trigger index building at all. So this is as expected.
what is the size/rule to tigger the GPU index. @liliu-z

No size/rule, just because data is still in a growing segment.

@liliu-z
Copy link
Member

liliu-z commented Oct 13, 2023

Could we stop GPU index on CPU machine more user friendly? like report error in advance instead of milvus crash.

Make sense to catch an exception and throw it out to let indexCoord retry. @Presburger can you help take a look?

@yanliang567 yanliang567 modified the milestones: 2.3.2, 2.3.3 Nov 3, 2023
@yanliang567 yanliang567 modified the milestones: 2.3.3, 2.3.4 Nov 16, 2023
Copy link

stale bot commented Dec 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Dec 17, 2023
@binbinlv
Copy link
Contributor Author

binbinlv commented Dec 18, 2023

keep open, remove stale

@binbinlv binbinlv reopened this Dec 18, 2023
@yanliang567 yanliang567 modified the milestones: 2.3.11, 2.3.12 Mar 11, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.12, 2.3.13 Mar 22, 2024
sre-ci-robot pushed a commit that referenced this issue Apr 3, 2024
)

issue: #27589 
pr: #31844

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 9, 2024
)

issue: #27589

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
jaime0815 pushed a commit that referenced this issue Apr 11, 2024
)

issue: #27589

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

(cherry picked from commit 1b76766)
Signed-off-by: jaime <yun.zhang@zilliz.com>
jaime0815 pushed a commit to jaime0815/milvus that referenced this issue Apr 12, 2024
…vus-io#31844)

issue: milvus-io#27589

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

(cherry picked from commit 1b76766)
Signed-off-by: jaime <yun.zhang@zilliz.com>
@yanliang567 yanliang567 modified the milestones: 2.3.13, 2.3.14 Apr 15, 2024
sunby pushed a commit to sunby/milvus that referenced this issue Apr 22, 2024
Signed-off-by: chyezh <chyezh@outlook.com>

Add metric for lru and fix lost delete data when enable lazy load  (milvus-io#31868)

Signed-off-by: chyezh <chyezh@outlook.com>

feat: Support stream reduce v1 (milvus-io#31873)

related: milvus-io#31410

---------

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

Change do wait lru dev (milvus-io#31878)

Signed-off-by: sunby <sunbingyi1992@gmail.com>

enhance: add config for disk cache (milvus-io#31881)

fix config not initialized (milvus-io#31890)

Signed-off-by: sunby <sunbingyi1992@gmail.com>

fix error handle in search (milvus-io#31895)

Signed-off-by: sunby <sunbingyi1992@gmail.com>

fix: thread safe vector (milvus-io#31898)

fix: insert record cannot reinsert (milvus-io#31900)

enhance: cancel concurrency restrict for stream reduce and add metrics (milvus-io#31892)

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

fix: bit set (milvus-io#31905)

fix bitset clear to reset (milvus-io#31908)

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

Fix 0404 lru dev (milvus-io#31914)

fix:
1. sealed_segment num_rows reset to std::null opt
2. sealed_segment lazy_load reset to true after evicting to avoid
shortcut

---------

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

fix possible block due to unpin fifo activating principle (milvus-io#31924)

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

Add lru reloader lru dev (milvus-io#31952)

Signed-off-by: sunby <sunbingyi1992@gmail.com>

fix query limit (milvus-io#32060)

Signed-off-by: sunby <sunbingyi1992@gmail.com>

fix: lru cache lost delete and wrong mem size (milvus-io#32072)

issue: milvus-io#30361

Signed-off-by: chyezh <chyezh@outlook.com>

enhance: add more metrics for cache and search (milvus-io#31777) (milvus-io#32097)

issue: milvus-io#30931

Signed-off-by: chyezh <chyezh@outlook.com>

fix:panic due to empty search result when stream reducing(milvus-io#32009) (milvus-io#32083)

related: milvus-io#32009

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

fix: sealed segment may not exist when throw (milvus-io#32098)

issue: milvus-io#30361

Signed-off-by: chyezh <chyezh@outlook.com>

Major compaction 1st edition (milvus-io#31804) (milvus-io#32116)

Signed-off-by: wayblink <anyang.wang@zilliz.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Signed-off-by: chasingegg <chao.gao@zilliz.com>
Co-authored-by: chasingegg <chao.gao@zilliz.com>

fix: inconsistent between state lock and load state (milvus-io#32171)

issue: milvus-io#30361

Signed-off-by: chyezh <chyezh@outlook.com>

enhance: Throw error instead of crash when index cannot be built (milvus-io#31844)

issue: milvus-io#27589

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

(cherry picked from commit 1b76766)
Signed-off-by: jaime <yun.zhang@zilliz.com>

update knowhere to support clustering (milvus-io#32188)

Signed-off-by: chasingegg <chao.gao@zilliz.com>

fix: segment release is not sync with cache (milvus-io#32212)

issue: milvus-io#32206

Signed-off-by: chyezh <chyezh@outlook.com>

fix: incorrect pinCount resulting unexpected eviction(milvus-io#32136) (milvus-io#32238)

related: milvus-io#32136

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

fix: possible panic when stream reducing (milvus-io#32247)

related: milvus-io#32009

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

enhance: [lru-dev] add the related data size for the read apis (milvus-io#32274)

cherry-pick: milvus-io#31816

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>

add debug log (milvus-io#32303)

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

Refine code for analyze task scheduler (milvus-io#32122)

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

fix: memory leak on stream reduce (milvus-io#32345)

related: milvus-io#32304

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

feat: adding cache stats support (milvus-io#32344)

See milvus-io#32067

Signed-off-by: Ted Xu <ted.xu@zilliz.com>

Fix bug for version (milvus-io#32363)

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

fix: remove sub entity in load delta log, update entity num in segment itself (milvus-io#32350)

issue: milvus-io#30361

Signed-off-by: chyezh <chyezh@outlook.com>

fix: clear data when loading failure (milvus-io#32370)

issue: milvus-io#30361

Signed-off-by: chyezh <chyezh@outlook.com>

fix: stream reduce memory leak for failing to release stream reducer(milvus-io#32345) (milvus-io#32381)

related: milvus-io#32345

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

Keep InProgress state when getting task state is init (milvus-io#32394)

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>

add log for search failed (milvus-io#32367)

related: milvus-io#32136

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

enable asan by default (milvus-io#32423)

Signed-off-by: sunby <sunbingyi1992@gmail.com>

Major compaction refactoring (milvus-io#32149)

Signed-off-by: wayblink <anyang.wang@zilliz.com>

Lru dev debug (milvus-io#32414)

Co-authored-by: wayblink <anyang.wang@zilliz.com>

fix: protect loadInfo with atomic, remove rlock at cache to avoid dead lock (milvus-io#32436)

issue: milvus-io#32435

Signed-off-by: chyezh <chyezh@outlook.com>

fix: use Get but not GetBy of SegmentManager (milvus-io#32438)

issue: milvus-io#32435

Signed-off-by: chyezh <chyezh@outlook.com>

fix: return growing segment when sealed (milvus-io#32460)

issue: milvus-io#32435

Signed-off-by: chyezh <chyezh@outlook.com>

enhance: add request resource for lru loading process(milvus-io#32205) (milvus-io#32452)

related: milvus-io#32205

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

fix: unexpected deleted index files when lazy loading(milvus-io#32136) (milvus-io#32469)

related: milvus-io#32136

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>

fix: reference count leak cause release blocked (milvus-io#32465)

issue: milvus-io#32379

Signed-off-by: chyezh <chyezh@outlook.com>

Fix compaction fail (milvus-io#32473)

Signed-off-by: wayblink <anyang.wang@zilliz.com>
@yanliang567 yanliang567 modified the milestones: 2.3.14, 2.3.15 Apr 23, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.15, 2.3.16 May 16, 2024
Copy link

stale bot commented Jun 15, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jun 15, 2024
@RakeshRaj97
Copy link

Can we load a collection index built using GPU_IVF_FLAT index to a CPU node which has more DRAM?

@stale stale bot removed the stale indicates no udpates for 30 days label Jun 20, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.16, 2.3.19 Jul 9, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.19, 2.3.20 Jul 19, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.20, 2.3.21 Aug 12, 2024
Copy link

stale bot commented Sep 11, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Sep 11, 2024
@stale stale bot closed this as completed Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
func2.3.2 function issues in 2.3.2 kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants