Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: querynode restarts due to SIGSEGV: segmentation violation after etcd follower pod failure chaos test #35483

Closed
1 task done
zhuwenxing opened this issue Aug 15, 2024 · 18 comments
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. stale indicates no udpates for 30 days test/chaos chaos test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20240814-c42976ee-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I20240814 09:38:05.100278  6108 SegmentSealedImpl.cpp:108] [SERVER][LoadVecIndex][milvus] Before setting field_bit for field index, fieldID:111. segmentID:451838354631885067, 
I20240814 09:38:05.100486  6108 SegmentSealedImpl.cpp:125] [SERVER][LoadVecIndex][milvus] Has load vec index done, fieldID:111. segmentID:451838354631885067, 
[2024/08/14 09:38:05.100 +00:00] [INFO] [segments/segment.go:1207] ["updateSegmentIndex done"] [traceID=d3b3e901a43f7bf13fa720efa7d76e14] [collectionID=451838354629667593] [partitionID=451838354629667594] [segmentID=451838354631885067] [fieldID=111]
I20240814 09:38:05.100801  6111 load_index_c.cpp:236] [SERVER][AppendIndexV2][milvus] [collection=451838354629667593][segment=451838354631885067][field=100][enable_mmap=false] load index 451838354629667625
[2024-08-14T09:38:05Z INFO  tantivy::indexer::segment_updater] save metas
add<folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<>&&, folly::futures::detail::State)::<lambda(folly::Executor::KeepAlive<>&&)> >
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/build/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/folly/Executor.h:186 pc=0x7f6b52b2334c
operator()<folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<>&&, folly::futures::detail::State)::<lambda(folly::Executor::KeepAlive<>&&)> >
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/build/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/folly/futures/detail/Core.cpp:583 pc=0x7f6b52b2334c
_ZN5folly7futures6detail8CoreBase10doCallbackEONS_8Executor9KeepAliveIS3_EENS1_5StateE
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/build/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/folly/futures/detail/Core.cpp:608 pc=0x7f6b52b2334c
_ZN5folly7futures6detail8CoreBase12setCallback_EONS_8FunctionIFvRS2_ONS_8Executor9KeepAliveIS5_EEPNS_17exception_wrapperEEEEOSt10shared_ptrINS_14RequestContextEENS1_18InlineContinuationE
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/build/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/folly/futures/detail/Core.cpp:468 pc=0x7f6b52b24053
I20240814 09:38:05.205202  6111 load_index_c.cpp:300] [SERVER][AppendIndexV2][milvus] [collection=451838354629667593][segment=451838354631885067][field=100][enable_mmap=false] load index 451838354629667625 done
[2024/08/14 09:38:05.205 +00:00] [INFO] [segments/segment.go:1207] ["updateSegmentIndex done"] [traceID=d3b3e901a43f7bf13fa720efa7d76e14] [collectionID=451838354629667593] [partitionID=451838354629667594] [segmentID=451838354631885067] [fieldID=100]
I20240814 09:38:05.205718  6111 load_index_c.cpp:236] [SERVER][AppendIndexV2][milvus] [collection=451838354629667593][segment=451838354631885067][field=101][enable_mmap=false] load index 451838354629667646
[2024-08-14T09:38:05Z INFO  tantivy::indexer::segment_updater] save metas
setCallback<folly::futures::detail::FutureBase<folly::Unit>::thenImplementation<folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void> >(folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>&&, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void>, folly::futures::detail::InlineContinuation)::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)> >
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/package/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/include/folly/futures/detail/Core.h:632 pc=0x7f6b59d86277
setCallback_<folly::futures::detail::FutureBase<folly::Unit>::thenImplementation<folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void> >(folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>&&, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void>, folly::futures::detail::InlineContinuation)::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)> >
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/package/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/include/folly/futures/Future-inl.h:310 pc=0x7f6b59d86277
setCallback_<folly::futures::detail::FutureBase<folly::Unit>::thenImplementation<folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void> >(folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>&&, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void>, folly::futures::detail::InlineContinuation)::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)> >
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/package/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/include/folly/futures/Future-inl.h:318 pc=0x7f6b59d86277
thenImplementation<folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, folly::futures::detail::tryExecutorCallableResult<folly::Unit, folly::Future<folly::Unit>::thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>(milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&) &&::<lambda(folly::Executor::KeepAlive<>&&, folly::Try<folly::Unit>&&)>, void> >
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/package/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/include/folly/futures/Future-inl.h:379 pc=0x7f6b59d86277
thenTry<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/package/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/include/folly/futures/Future-inl.h:945 pc=0x7f6b59d86277
then<milvus::futures::Future<milvus::SearchResult>::asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >(folly::Executor::KeepAlive<>, int, AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)>&&)::<lambda(auto:85&&)>&>
	/root/.conan/data/folly/2023.10.30.08/milvus/dev/package/71e52ec7e6bdcb39e8f12e598f0e25527e54965c/include/folly/futures/Future.h:1240 pc=0x7f6b59d86277
asyncProduce<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >
	/workspace/source/internal/core/src/futures/Future.h:188 pc=0x7f6b59d86277
async<AsyncSearch(CTraceContext, CSegmentInterface, CSearchPlan, CPlaceholderGroup, uint64_t)::<lambda(milvus::futures::CancellationToken)> >
	/workspace/source/internal/core/src/futures/Future.h:98 pc=0x7f6b59d86277
AsyncSearch
	/workspace/source/internal/core/src/segcore/segment_c.cpp:121 pc=0x7f6b59d86277
_cgo_548efe5569b7_Cfunc_AsyncSearch
	/tmp/go-build/cgo-gcc-prolog:121 pc=0x501a1ec
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1ef4087


SIGSEGV: segmentation violation
PC=0x7f6b52987c89 m=3092 sigcode=1
signal arrived during cgo execution

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/17071/pipeline

log:
artifacts-etcd-followers-pod-failure-17071-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 15, 2024
@zhuwenxing zhuwenxing changed the title [Bug]: querynode restarts due to SIGSEGV: segmentation violation after etcd pod failure chaos test [Bug]: querynode restarts due to SIGSEGV: segmentation violation after etcd follower pod failure chaos test Aug 15, 2024
@zhuwenxing zhuwenxing added the test/chaos chaos test label Aug 15, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Aug 15, 2024
@zhuwenxing zhuwenxing added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Aug 15, 2024
@binbinlv
Copy link
Contributor

/assign @weiliu1031
could you please have a look? Thanks

@binbinlv binbinlv added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 15, 2024
@xiaofan-luan
Copy link
Contributor

xiaofan-luan commented Aug 15, 2024

@zhuwenxing
is this only a issue on master? Is this on ARM or X86?

@xiaofan-luan
Copy link
Contributor

xiaofan-luan commented Aug 15, 2024

@zhuwenxing

please make sure you are using the version with no clusterIP to do etcd kills test. I some some error comes etcd is not connected. Check with @LoveEachDay and make sure you use the correct setup.

ideally we shouldn't see panic on this etcd connect failed

[2024/08/14 09:08:21.394 +00:00] [DEBUG] [querynode/service.go:118] ["QueryNode connect to etcd failed"] [error="context deadline exceeded"]
[2024/08/14 09:08:21.394 +00:00] [ERROR] [components/query_node.go:56] ["QueryNode starts error"] [error="context deadline exceeded"] [stack="github.com/milvus-io/milvus/cmd/components.(*QueryNode).Run\n\t/workspace/source/cmd/components/query_node.go:56\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/workspace/source/cmd/roles/roles.go:121"]
panic: context deadline exceeded

Still checking the SigSeg issue

@chyezh
Copy link
Contributor

chyezh commented Aug 16, 2024

@zhuwenxing Is these issue reproduced?

@chyezh
Copy link
Contributor

chyezh commented Aug 16, 2024

/assign chyezh

@zhuwenxing
Copy link
Contributor Author

@xiaofan-luan
only reproduced in master. It's AMD because the testing cluster consists of AMD machines.

@LoveEachDay
instance is created by helm version milvus-4.2.5.tgz, can you help to check the setup.

@chyezh
It is not a stable reproduced issue. for now, it only happened once.

@LoveEachDay
Copy link
Contributor

image
Using three headless-service address for three etcd members with etcd 3.5.14.

@cqy123456
Copy link
Contributor

cqy123456 commented Aug 16, 2024

crash in a async search in segment 451838354632090846

[2024/08/14 09:37:56.144 +00:00] [DEBUG] [segments/segment.go:499] ["search segment..."] [traceID=0afd39eaef3452ce9e8c8832ac9a6c58] [collectionID=451838354629467151] [segmentID=451838354632090846] [segmentType=Sealed] [withIndex=false]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER
SIGNO: 11; SIGNAME: Segmentation fault; SI_CODE: 1; SI_ADDR: 0x7f6864980050

but this segment still in loading:

[2024/08/14 09:37:58.006 +00:00] [INFO] [segments/segment_loader.go:541] ["start loading remote..."] [traceID=0b7133f7d6374b23acbde92672342745] [collectionID=451838354629467151] [segmentIDs="[451838354632090846]"] [segmentNum=1]
[2024/08/14 09:37:58.006 +00:00] [INFO] [segments/segment_loader.go:551] ["loading bloom filter for remote..."] [traceID=0b7133f7d6374b23acbde92672342745] [collectionID=451838354629467151] [segmentIDs="[451838354632090846]"]
[2024/08/14 09:37:58.015 +00:00] [INFO] [segments/segment_loader.go:945] ["Successfully load pk stats"] [traceID=0b7133f7d6374b23acbde92672342745] [segmentID=451838354632090846] [time=9.151753ms] [size=34304]

@chyezh
Copy link
Contributor

chyezh commented Aug 16, 2024

load segment has been done.

[2024/08/14 09:37:54.495 +00:00] [INFO] [querynodev2/services.go:492] ["load segments done..."] [traceID=5ed136892591447ab531c9fa37abd7d9] [collectionID=451838354629467151] [partitionID=451838354629467152] [shard=by-dev-rootcoord-dml_1_451838354629467151v0] [segmentID=451838354632090846] [level=L1] [currentNodeID=3] [segments="[451838354632090846]"]

load delete data at 09:37:58

@yanliang567 yanliang567 removed their assignment Aug 17, 2024
@xiaofan-luan
Copy link
Contributor

any progress?

@chyezh
Copy link
Contributor

chyezh commented Aug 22, 2024

any progress?

Make asan available for milvus binary and image #35627, and trying to reproduce it.

@chyezh
Copy link
Contributor

chyezh commented Aug 22, 2024

and some odr violation #35549,#35633 is found and fixed #35610,
but not make sure whether it's related to this issue.

@chyezh
Copy link
Contributor

chyezh commented Aug 23, 2024

Find an assertion failure when reproducing.

milvus: /go/src/github.com/milvus-io/milvus/internal/core/src/exec/expression/EvalCtx.h:36: milvus::exec::EvalCtx::EvalCtx(milvus::exec::ExecContext*, milvus::exec::ExprSet*, milvus::RowVector*): Assertion `expr_set_ != nullptr' failed.

@chyezh
Copy link
Contributor

chyezh commented Aug 28, 2024

Find an assertion failure when reproducing.

milvus: /go/src/github.com/milvus-io/milvus/internal/core/src/exec/expression/EvalCtx.h:36: milvus::exec::EvalCtx::EvalCtx(milvus::exec::ExecContext*, milvus::exec::ExprSet*, milvus::RowVector*): Assertion `expr_set_ != nullptr' failed.

It's another unrelated issue, see #35771. doing reproduce again after the fix.

@chyezh
Copy link
Contributor

chyezh commented Aug 28, 2024

@zhuwenxing

please make sure you are using the version with no clusterIP to do etcd kills test. I some some error comes etcd is not connected. Check with @LoveEachDay and make sure you use the correct setup.

ideally we shouldn't see panic on this etcd connect failed

[2024/08/14 09:08:21.394 +00:00] [DEBUG] [querynode/service.go:118] ["QueryNode connect to etcd failed"] [error="context deadline exceeded"] [2024/08/14 09:08:21.394 +00:00] [ERROR] [components/query_node.go:56] ["QueryNode starts error"] [error="context deadline exceeded"] [stack="github.com/milvus-io/milvus/cmd/components.(*QueryNode).Run\n\t/workspace/source/cmd/components/query_node.go:56\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/workspace/source/cmd/roles/roles.go:121"] panic: context deadline exceeded

Still checking the SigSeg issue

It happens when testing initialization, etcd is not ready yet, and no etcd chaos have been injected.
Therefore, it meets expectations.

[2024-08-14T09:07:23.151Z] + helm install --wait --debug --timeout 600s etcd-followers-pod-failure-17071 milvus/milvus --set image.all.repository=harbor.milvus.io/milvus/milvus --set image.all.tag=master-20240814-c42976ee-amd64 --set metrics.serviceMonitor.enabled=true --set etcd.metrics.enabled=true --set etcd.metrics.podMonitor.enabled=true --set etcd.metrics.podMonitor.namespace=chaos-testing --set quotaAndLimits.enabled=false -f ../cluster-values.yaml -n=chaos-testing
[2024-08-14T09:07:23.154Z] WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
[2024-08-14T09:07:23.154Z] WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
[2024-08-14T09:07:23.154Z] install.go:178: [debug] Original chart version: ""
[2024-08-14T09:07:24.083Z] install.go:195: [debug] CHART PATH: /root/.cache/helm/repository/milvus-4.2.4.tgz
[2024-08-14T09:07:24.083Z] 
[2024-08-14T09:07:25.011Z] client.go:128: [debug] creating 42 resource(s)
[2024-08-14T09:07:25.267Z] wait.go:48: [debug] beginning wait for 42 resources with timeout of 10m0s
[2024-08-14T09:07:26.191Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:29.453Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:31.970Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:34.491Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:37.757Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:40.271Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:43.550Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:46.072Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:48.643Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:51.904Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:54.417Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:07:56.929Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:00.194Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:02.721Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:05.544Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:08.061Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:11.335Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:13.851Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:17.119Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:19.633Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:22.152Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:25.424Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:27.939Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:30.455Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:33.720Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:36.233Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:39.498Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:42.013Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:44.525Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready
[2024-08-14T09:08:47.794Z] ready.go:277: [debug] Deployment is not ready: chaos-testing/etcd-followers-pod-failure-17071-milvus-datanode. 0 out of 2 expected pods are ready

Copy link

stale bot commented Sep 29, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Sep 29, 2024
@zhuwenxing
Copy link
Contributor Author

Not reproduced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. stale indicates no udpates for 30 days test/chaos chaos test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants