[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

andreidan · 2019-12-27T12:09:15Z

Times out with timed out waiting for green state

Reproduce line

./gradlew ':server:integTest' --tests "org.elasticsearch.recovery.RelocationIT.testIndexAndRelocateConcurrently" \
  -Dtests.seed=4E1BE2A47B1AF8FD \
  -Dtests.security.manager=true \
  -Dtests.locale=et \
  -Dtests.timezone=America/Nassau \
  -Dcompiler.java=13 \
  -Druntime.java=8

Build scans

[7.5.2] https://gradle-enterprise.elastic.co/s/spwlu6566rbue
[6.8.7] https://gradle-enterprise.elastic.co/s/evfxzrqeohj5q
[7.x] https://gradle-enterprise.elastic.co/s/ezlcf2b36tiew

This particular, latest, occurrence was found on 7.x - https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=ubuntu-16.04&&immutable/471/console - but this test timeouts every few days across the 7.x branches and master

The timeout was increased in #46554 which was merged on the 11th of September 2019, but the new timeout still isn't enough (maybe there is a different problem? or maybe the timeouts configuration need some attention to speedup the relocation?)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-27T12:09:17Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

original-brownbear · 2019-12-30T10:36:23Z

I was able to reproduce this with the seed from the failure here and running in a loop with concurrent stress -m on the machine.

=> failure log with DEBUG level logging (trying to interpret that now):

fail.log

original-brownbear · 2019-12-30T14:33:03Z

I can fairly reliably reproduce this on 7.x but not on master using stress. I don't have the knownledge to easiylt track down what fixed this in master (@dnhatn maybe you know what recent change could be behind this?) I'm unassigning myself here.

dnhatn · 2019-12-30T14:39:47Z

Thanks @original-brownbear :). I will take a look.

ebadyano · 2020-01-02T19:03:30Z

Another one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-8&&immutable/489/console

ywelsch · 2020-01-06T13:44:00Z

I think I know what's causing the test failure here. An interesting observation though is that with the change to not refresh under the engine lock anymore, refreshes (in particular IndexWriter. publishFrozenUpdates) can cause the IndexWriter to register an AlreadyClosedException as a tragic exception when the IndexWriter is concurrently closed.

[2019-12-30T18:05:01,763][DEBUG][o.e.c.s.ClusterApplierService] [node_t0] apply cluster state with version 15
[2019-12-30T18:05:01,764][DEBUG][o.e.i.c.IndicesClusterStateService] [node_t0] [test][1] removing shard (stale allocation id, stale [test][1], node[eKhxth_dTCqGhhxTsupxBg], relocating [ad8ncChdTQWIfLP05f2aHQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=u2EniyReSeiasEwzL47gtQ, rId=Qc9uZ7Q6QLCMtn70qeyDDw], new [test][1], node[eKhxth_dTCqGhhxTsupxBg], relocating [ad8ncChdTQWIfLP05f2aHQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=jvoF3ohjSY2M3cC6TUuuoQ, rId=Qc9uZ7Q6QLCMtn70qeyDDw])
[2019-12-30T18:05:01,764][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closing... (reason: [removing shard (stale copy)])
[2019-12-30T18:05:01,764][DEBUG][o.e.i.r.RecoveryTarget   ] [node_t0] [test][1] recovery canceled (reason: [shard closed])
[2019-12-30T18:05:01,764][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [RECOVERING]->[CLOSED], reason [removing shard (stale copy)]
[2019-12-30T18:05:01,764][DEBUG][o.e.i.e.Engine           ] [node_t0] [test][1] close now acquiring writeLock
[2019-12-30T18:05:01,764][DEBUG][o.e.i.e.Engine           ] [node_t0] [test][1] close acquired writeLock
[2019-12-30T18:05:01,775][DEBUG][o.e.i.t.Translog         ] [node_t0] [test][1] translog closed
[2019-12-30T18:05:01,784][DEBUG][o.e.i.e.Engine           ] [node_t0] [test][1] engine closed [api]
[2019-12-30T18:05:01,784][DEBUG][o.e.i.s.Store            ] [node_t0] [test][1] store reference count on close: 2
[2019-12-30T18:05:01,784][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closed (reason: [removing shard (stale copy)])
[2019-12-30T18:05:01,785][DEBUG][o.e.i.c.IndicesClusterStateService] [node_t0] [test][1] creating shard with primary term [1]
[2019-12-30T18:05:01,790][WARN ][o.e.i.e.Engine           ] [node_t0] [test][1] failed engine [already closed by tragic event on the index writer]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.getPooledInstance(IndexWriter.java:5233) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FrozenBufferedUpdates.openSegmentStates(FrozenBufferedUpdates.java:328) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FrozenBufferedUpdates.forceApply(FrozenBufferedUpdates.java:221) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FrozenBufferedUpdates.tryApply(FrozenBufferedUpdates.java:159) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$3(IndexWriter.java:2592) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5116) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:507) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:66) ~[main/:?]
	at org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:40) ~[main/:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:341) ~[main/:?]
	at org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:323) ~[main/:?]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) ~[lucene-core-8.4.0-snapshot-08b8d116f8f.jar:8.4.0-snapshot-08b8d116f8f 08b8d116f8ffacf35a6b05ff4d37f2263b712347 - ivera - 2019-12-12 10:52:52]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1605) [main/:?]
	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1582) [main/:?]
	at org.elasticsearch.index.shard.IndexShard.finalizeRecovery(IndexShard.java:1705) [main/:?]
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$finalizeRecovery$1(RecoveryTarget.java:313) [main/:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:285) [main/:?]
	at org.elasticsearch.indices.recovery.RecoveryTarget.finalizeRecovery(RecoveryTarget.java:294) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:395) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FinalizeRecoveryRequestHandler.messageReceived(PeerRecoveryTargetService.java:389) [main/:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) [main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [main/:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-12-30T18:05:01,805][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [test][1] loaded data path [/tmp/org.elasticsearch.recovery.RelocationIT_4E1BE2A47B1AF8FD-026/tempDir-002/node_t0-shared/rqjdpeSeGW/0/Li0OFzARRZKiZEcEi5bJFg/1], state path [/tmp/org.elasticsearch.recovery.RelocationIT_4E1BE2A47B1AF8FD-026/tempDir-002/node_t0/nodes/0/indices/Li0OFzARRZKiZEcEi5bJFg/1]
[2019-12-30T18:05:01,805][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [test][1] creating using an existing path [ShardPath{path=/tmp/org.elasticsearch.recovery.RelocationIT_4E1BE2A47B1AF8FD-026/tempDir-002/node_t0-shared/rqjdpeSeGW/0/Li0OFzARRZKiZEcEi5bJFg/1, shard=[test][1]}]
[2019-12-30T18:05:01,805][DEBUG][o.e.i.IndexService       ] [node_t0] [test] creating shard_id [test][1]
[2019-12-30T18:05:01,807][DEBUG][o.e.i.s.Store            ] [node_t0] [test][1] store stats are refreshed with refresh_interval [10s]
[2019-12-30T18:05:01,808][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [CREATED]
[2019-12-30T18:05:01,813][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [CREATED]->[RECOVERING], reason [from {node_t1}{jR37Lc8RQoin4NKVawanrg}{8LSM6L-ZRuyoUHOwe6onzw}{127.0.0.1}{127.0.0.1:41933}{dim}{color=blue}]
[2019-12-30T18:05:01,830][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closing... (reason: [shard failure, reason [already closed by tragic event on the index writer]])
[2019-12-30T18:05:01,831][DEBUG][o.e.i.r.RecoveryTarget   ] [node_t0] [test][1] recovery canceled (reason: [shard closed])
[2019-12-30T18:05:01,831][DEBUG][o.e.i.s.IndexShard       ] [node_t0] [test][1] state: [RECOVERING]->[CLOSED], reason [shard failure, reason [already closed by tragic event on the index writer]]
[2019-12-30T18:05:01,831][DEBUG][o.e.i.s.Store            ] [node_t0] [test][1] store reference count on close: 1
[2019-12-30T18:05:01,831][DEBUG][o.e.i.IndexService       ] [node_t0] [test] [1] closed (reason: [shard failure, reason [already closed by tragic event on the index writer]])

Maybe something to adapt in Lucene not to treat concurrent closing as a fatal exception...

Now on to the real issue here:

It looks like a shard is reallocated to a node that was previously hosting a recovering copy of the shard. IndicesClusterStateService shuts down the prior shard copy and then reallocates the new shard copy. The prior shard copy was recovering and got failed with a tragedy due to the above issue (i.e. treating an AlreadyClosedException as a tragedy). When this got (asynchronously) bubbled up to IndicesClusterStateService.handleRecoveryFailure, it removed the newly allocating shard copy (which has a different allocation id), but notified the master only about failing the old copy. The master receives this message and goes:

[2019-12-30T18:05:01,923][DEBUG][o.e.c.a.s.ShardStateAction] [node_t2] [test][1] ignoring shard failed task [shard id [[test][1]], allocation id [u2EniyReSeiasEwzL47gtQ], primary term [0], message [shard failure, reason [already closed by tragic event on the index writer]], failure [AlreadyClosedException[this IndexWriter is closed]], markAsStale [true]] (shard does not exist anymore)

i.e. thinks nothing needs to be done. This means we ultimately end up in a situation where the master thinks the new shard copy is still recovering, but it has been removed on the node.

I will open a fix for this.

A failure of a recovering shard can race with a new allocation of the shard, and cause the new allocation to be failed as well. This can result in a shard being marked as initializing in the cluster state, but not exist on the node anymore. Closes #50508

A failure of a recovering shard can race with a new allocation of the shard, and cause the new allocation to be failed as well. This can result in a shard being marked as initializing in the cluster state, but not exist on the node anymore. Closes elastic#50508

andreidan added >test-failure Triaged test failures from CI :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Dec 27, 2019

original-brownbear self-assigned this Dec 27, 2019

original-brownbear removed their assignment Dec 30, 2019

dnhatn self-assigned this Dec 30, 2019

ywelsch mentioned this issue Jan 6, 2020

Check allocation id when failing shard on recovery #50656

Merged

ywelsch assigned ywelsch and unassigned dnhatn Jan 6, 2020

ywelsch closed this as completed in #50656 Jan 7, 2020

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

andreidan commented Dec 27, 2019

elasticmachine commented Dec 27, 2019

original-brownbear commented Dec 30, 2019

original-brownbear commented Dec 30, 2019

dnhatn commented Dec 30, 2019

ebadyano commented Jan 2, 2020

ywelsch commented Jan 6, 2020

[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

[CI] RelocationIT.testIndexAndRelocateConcurrently timeouts #50508

Comments

andreidan commented Dec 27, 2019

elasticmachine commented Dec 27, 2019

original-brownbear commented Dec 30, 2019

original-brownbear commented Dec 30, 2019

dnhatn commented Dec 30, 2019

ebadyano commented Jan 2, 2020

ywelsch commented Jan 6, 2020