Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

Closed
andreidan opened this issue Dec 27, 2019 · 6 comments · Fixed by #50656
Closed

[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

andreidan opened this issue Dec 27, 2019 · 6 comments · Fixed by #50656
Assignees
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI

Comments

@andreidan
Copy link
Contributor

Times out with timed out waiting for green state

Reproduce line

./gradlew ':server:integTest' --tests "org.elasticsearch.recovery.RelocationIT.testIndexAndRelocateConcurrently" \
  -Dtests.seed=4E1BE2A47B1AF8FD \
  -Dtests.security.manager=true \
  -Dtests.locale=et \
  -Dtests.timezone=America/Nassau \
  -Dcompiler.java=13 \
  -Druntime.java=8

Build scans

[7.5.2] https://gradle-enterprise.elastic.co/s/spwlu6566rbue
[6.8.7] https://gradle-enterprise.elastic.co/s/evfxzrqeohj5q
[7.x] https://gradle-enterprise.elastic.co/s/ezlcf2b36tiew

This particular, latest, occurrence was found on 7.x - https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=ubuntu-16.04&&immutable/471/console - but this test timeouts every few days across the 7.x branches and master

The timeout was increased in #46554 which was merged on the 11th of September 2019, but the new timeout still isn't enough (maybe there is a different problem? or maybe the timeouts configuration need some attention to speedup the relocation?)

@andreidan andreidan added >test-failure Triaged test failures from CI :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Dec 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@original-brownbear original-brownbear self-assigned this Dec 27, 2019
@original-brownbear
Copy link
Member

I was able to reproduce this with the seed from the failure here and running in a loop with concurrent stress -m on the machine.

=> failure log with DEBUG level logging (trying to interpret that now):

fail.log

@original-brownbear
Copy link
Member

I can fairly reliably reproduce this on 7.x but not on master using stress. I don't have the knownledge to easiylt track down what fixed this in master (@dnhatn maybe you know what recent change could be behind this?) I'm unassigning myself here.

@original-brownbear original-brownbear removed their assignment Dec 30, 2019
@dnhatn dnhatn self-assigned this Dec 30, 2019
@dnhatn
Copy link
Member

dnhatn commented Dec 30, 2019

Thanks @original-brownbear :). I will take a look.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 6, 2020

I think I know what's causing the test failure here. An interesting observation though is that with the change to not refresh under the engine lock anymore, refreshes (in particular IndexWriter. publishFrozenUpdates) can cause the IndexWriter to register an AlreadyClosedException as a tragic exception when the IndexWriter is concurrently closed.

[2019-12-30T18:05:01,763][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] apply cluster state with version 15
[2019-12-30T18:05:01,764][DEBUG][o.e.i.c.IndicesClusterStateService] [node_t0] [test][1] removing shard (stale allocation id, stale [test][1], node[eKhxth_dTCqGhhxTsupxBg], relocating [ad8ncChdTQWIfLP05f2aHQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=u2EniyReSeiasEwzL47gtQ, rId=Qc9uZ7Q6QLCMtn70qeyDDw], new [test][1], node[eKhxth_dTCqGhhxTsupxBg], relocating [ad8ncChdTQWIfLP05f2aHQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=jvoF3ohjSY2M3cC6TUuuoQ, rId=Qc9uZ7Q6QLCMtn70qeyDDw])
[2019-12-30T18:05:01,764][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closing... (reason: [removing shard (stale copy)])
[2019-12-30T18:05:01,764][DEBUG][o.e.i.r.RecoveryTarget   ] [node_t0] [test][1] recovery canceled (reason: [shard closed])
[2019-12-30T18:05:01,764][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [RECOVERING]->[CLOSED], reason [removing shard (stale copy)]
[2019-12-30T18:05:01,764][DEBUG][o.e.i.e.Engine           ] [node_t0] [test][1] close now acquiring writeLock
[2019-12-30T18:05:01,764][DEBUG][o.e.i.e.Engine           ] [node_t0] [test][1] close acquired writeLock
[2019-12-30T18:05:01,775][DEBUG][o.e.i.t.Translog         ] [node_t0] [test][1] translog closed
[2019-12-30T18:05:01,784][DEBUG][o.e.i.e.Engine           ] [node_t0] [test][1] engine closed [api]
[2019-12-30T18:05:01,784][DEBUG][o.e.i.s.Store            ] [node_t0] [test][1] store reference count on close: 2
[2019-12-30T18:05:01,784][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closed (reason: [removing shard (stale copy)])
[2019-12-30T18:05:01,785][DEBUG][o.e.i.c.IndicesClusterStateService] [node_t0] [test][1] creating shard with primary term [1]
[2019-12-30T18:05:01,790][WARN ][o.e.i.e.Engine           ] [node_t0] [test][1] failed engine [already closed by tragic event on the index writer]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.getPooledInstance(IndexWriter.java:5233) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FrozenBufferedUpdates.openSegmentStates(FrozenBufferedUpdates.java:328) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FrozenBufferedUpdates.forceApply(FrozenBufferedUpdates.java:221) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FrozenBufferedUpdates.tryApply(FrozenBufferedUpdates.java:159) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$3(IndexWriter.java:2592) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5116) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:507) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:66) ~[main/:?]
	at org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:40) ~[main/:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:341) ~[main/:?]
	at org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:323) ~[main/:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1605) [main/:?]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1582) [main/:?]
	at org.elasticsearch.index.shard.IndexShard.finalizeRecovery(IndexShard.java:1705) [main/:?]
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$finalizeRecovery$1(RecoveryTarget.java:313) [main/:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:285) [main/:?]
	at org.elasticsearch.indices.recovery.RecoveryTarget.finalizeRecovery(RecoveryTarget.java:294) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:395) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:389) [main/:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) [main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [main/:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-12-30T18:05:01,805][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [test][1] loaded data path [/tmp/org.elasticsearch.recovery.RelocationIT_4E1BE2A47B1AF8FD-026/tempDir-002/node_t0-shared/rqjdpeSeGW/0/Li0OFzARRZKiZEcEi5bJFg/1], state path [/tmp/org.elasticsearch.recovery.RelocationIT_4E1BE2A47B1AF8FD-026/tempDir-002/node_t0/nodes/0/indices/Li0OFzARRZKiZEcEi5bJFg/1]
[2019-12-30T18:05:01,805][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [test][1] creating using an existing path [ShardPath{path=/tmp/org.elasticsearch.recovery.RelocationIT_4E1BE2A47B1AF8FD-026/tempDir-002/node_t0-shared/rqjdpeSeGW/0/Li0OFzARRZKiZEcEi5bJFg/1, shard=[test][1]}]
[2019-12-30T18:05:01,805][DEBUG][o.e.i.IndexService       ] [node_t0] [test] creating shard_id [test][1]
[2019-12-30T18:05:01,807][DEBUG][o.e.i.s.Store            ] [node_t0] [test][1] store stats are refreshed with refresh_interval [10s]
[2019-12-30T18:05:01,808][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [CREATED]
[2019-12-30T18:05:01,813][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [CREATED]->[RECOVERING], reason [from {node_t1}{jR37Lc8RQoin4NKVawanrg}{8LSM6L-ZRuyoUHOwe6onzw}{127.0.0.1}{127.0.0.1:41933}{dim}{color=blue}]
[2019-12-30T18:05:01,830][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closing... (reason: [shard failure, reason [already closed by tragic event on the index writer]])
[2019-12-30T18:05:01,831][DEBUG][o.e.i.r.RecoveryTarget   ] [node_t0] [test][1] recovery canceled (reason: [shard closed])
[2019-12-30T18:05:01,831][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [RECOVERING]->[CLOSED], reason [shard failure, reason [already closed by tragic event on the index writer]]
[2019-12-30T18:05:01,831][DEBUG][o.e.i.s.Store            ] [node_t0] [test][1] store reference count on close: 1
[2019-12-30T18:05:01,831][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closed (reason: [shard failure, reason [already closed by tragic event on the index writer]])

Maybe something to adapt in Lucene not to treat concurrent closing as a fatal exception...

Now on to the real issue here:

It looks like a shard is reallocated to a node that was previously hosting a recovering copy of the shard. IndicesClusterStateService shuts down the prior shard copy and then reallocates the new shard copy. The prior shard copy was recovering and got failed with a tragedy due to the above issue (i.e. treating an AlreadyClosedException as a tragedy). When this got (asynchronously) bubbled up to IndicesClusterStateService.handleRecoveryFailure, it removed the newly allocating shard copy (which has a different allocation id), but notified the master only about failing the old copy. The master receives this message and goes:

[2019-12-30T18:05:01,923][DEBUG][o.e.c.a.s.ShardStateAction] [node_t2] [test][1] ignoring shard failed task [shard id [[test][1]], allocation id [u2EniyReSeiasEwzL47gtQ], primary term [0], message [shard failure, reason [already closed by tragic event on the index writer]], failure [AlreadyClosedException[this IndexWriter is closed]], markAsStale [true]] (shard does not exist anymore)

i.e. thinks nothing needs to be done. This means we ultimately end up in a situation where the master thinks the new shard copy is still recovering, but it has been removed on the node.

I will open a fix for this.

@ywelsch ywelsch assigned ywelsch and unassigned dnhatn Jan 6, 2020
ywelsch added a commit that referenced this issue Jan 7, 2020
A failure of a recovering shard can race with a new allocation of the shard, and cause the new
allocation to be failed as well. This can result in a shard being marked as initializing in the cluster
state, but not exist on the node anymore.

Closes #50508
ywelsch added a commit that referenced this issue Jan 7, 2020
A failure of a recovering shard can race with a new allocation of the shard, and cause the new
allocation to be failed as well. This can result in a shard being marked as initializing in the cluster
state, but not exist on the node anymore.

Closes #50508
ywelsch added a commit that referenced this issue Jan 7, 2020
A failure of a recovering shard can race with a new allocation of the shard, and cause the new
allocation to be failed as well. This can result in a shard being marked as initializing in the cluster
state, but not exist on the node anymore.

Closes #50508
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
A failure of a recovering shard can race with a new allocation of the shard, and cause the new
allocation to be failed as well. This can result in a shard being marked as initializing in the cluster
state, but not exist on the node anymore.

Closes elastic#50508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants