Reduce number of writes needed for metadata updates #48701

DaveCTurner · 2019-10-30T16:32:34Z

Today we split the on-disk cluster metadata across many files: one file for the metadata of each index, plus one file for the global metadata and another for the manifest. Most metadata updates only touch a few of these files, but some must write them all. If a node holds a large number of indices then it's possible its disks are not fast enough to process a complete metadata update before timing out. In severe cases affecting master-eligible nodes this can prevent an election from succeeding.

We plan to change the format of on-disk metadata to reduce the number of writes needed during metadata updates. One option is a monolithic file containing the complete metadata, but this is inefficient in the common case that the metadata is mostly unchanged. Another option is to keep an append-only log of changes, but such a log must be compacted and this introduces quite some complexity. However we already have access to a very good storage mechanism that has the right kinds of properties: Lucene! We will use a dedicated Lucene index on each master-eligible node and replace each individual file with a document in this index. Most metadata updates will need only a few writes, and Lucene's background merging will take care of compaction.

On master-ineligible nodes we can keep the existing format and still reduce the writes required, because we can make better use of the fact that master-ineligible nodes only write committed metadata and therefore the version numbers are trustworthy. It may also be possible to avoid writing index metadata during cluster state application entirely and defer it until later.

Later:

Investigate alternative merging strategies -- is there a benefit in merging in the background after a commit rather than doing it inline while flushing?
Investigate the performance of duplicated indexing across multiple data paths and contemplate alternatives (ref. Introduce Lucene-based metadata persistence #48733 (comment))
Optimize file-based metadata storage to trust metadata versions
Implement rescue tool for when global metadata document is missing or when there are duplicated docs (ref. Introduce Lucene-based metadata persistence #48733 (comment))

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-30T16:32:37Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

This commit introduces `LucenePersistedState` which master-eligible nodes can use to persist the cluster metadata in a Lucene index rather than in many separate files. Relates elastic#48701

This commit introduces `LucenePersistedState` which master-eligible nodes can use to persist the cluster metadata in a Lucene index rather than in many separate files. Relates #48701

Today on master-eligible nodes we maintain per-index metadata files for every index. However, we also keep this metadata in the `LucenePersistedState`, and only use the per-index metadata files for importing dangling indices. However there is no point in importing a dangling index without any shard data, so we do not need to maintain these extra files any more. This commit removes per-index metadata files from nodes which do not hold any shards of those indices. Relates elastic#48701

Today on master-eligible nodes we maintain per-index metadata files for every index. However, we also keep this metadata in the `LucenePersistedState`, and only use the per-index metadata files for importing dangling indices. However there is no point in importing a dangling index without any shard data, so we do not need to maintain these extra files any more. This commit removes per-index metadata files from nodes which do not hold any shards of those indices. Relates #48701

This moves metadata persistence to Lucene for all node types. It also reenables BWC and adds an interoperability layer for upgrades from prior versions. This commit disables a number of tests related to dangling indices and command-line tools. Those will be addressed in follow-ups. Relates #48701

Loading shard state information during shard allocation sometimes runs into a situation where a data node does not know yet how to look up the shard on disk if custom data paths are used. The current implementation loads the index metadata from disk to determine what the custom data path looks like. This PR removes this dependency, simplifying the lookup. Relates #48701

Adds command-line tool support (unsafe-bootstrap, detach-cluster, repurpose, & shard commands) for the Lucene-based metadata storage. Relates #48701

Earlier PRs for #48701 introduced a separate directory for the cluster state. This is not needed though, and introduces an additional unnecessary cognitive burden to the users. Co-Authored-By: David Turner <david.turner@elastic.co>

Adds support for writing out dangling indices in an asynchronous way. Also provides an option to avoid writing out dangling indices at all. Relates #48701

Moves node metadata to uses the new storage mechanism (see #48701) as the authoritative source.

Writes cluster states out asynchronously on data-only nodes. The main reason for writing out the cluster state at all is so that the data-only nodes can snap into a cluster, that they can do a bit of bootstrap validation and so that the shard recovery tools work. Cluster states that are written asynchronously have their voting configuration adapted to a non existing configuration so that these nodes cannot mistakenly become master even if their node role is changed back and forth. Relates #48701

Has the new cluster state storage layer emit warnings in case metadata performance is very slow. Relates #48701

Adds a command-line tool to remove broken custom metadata from the cluster state. Relates to #48701

Loading shard state information during shard allocation sometimes runs into a situation where a data node does not know yet how to look up the shard on disk if custom data paths are used. The current implementation loads the index metadata from disk to determine what the custom data path looks like. This PR removes this dependency, simplifying the lookup. Relates elastic#48701

Today we split the on-disk cluster metadata across many files: one file for the metadata of each index, plus one file for the global metadata and another for the manifest. Most metadata updates only touch a few of these files, but some must write them all. If a node holds a large number of indices then it's possible its disks are not fast enough to process a complete metadata update before timing out. In severe cases affecting master-eligible nodes this can prevent an election from succeeding. This commit uses Lucene as a metadata storage for the cluster state, and is a squashed version of the following PRs that were targeting a feature branch: * Introduce Lucene-based metadata persistence (elastic#48733) This commit introduces `LucenePersistedState` which master-eligible nodes can use to persist the cluster metadata in a Lucene index rather than in many separate files. Relates elastic#48701 * Remove per-index metadata without assigned shards (elastic#49234) Today on master-eligible nodes we maintain per-index metadata files for every index. However, we also keep this metadata in the `LucenePersistedState`, and only use the per-index metadata files for importing dangling indices. However there is no point in importing a dangling index without any shard data, so we do not need to maintain these extra files any more. This commit removes per-index metadata files from nodes which do not hold any shards of those indices. Relates elastic#48701 * Use Lucene exclusively for metadata storage (elastic#50144) This moves metadata persistence to Lucene for all node types. It also reenables BWC and adds an interoperability layer for upgrades from prior versions. This commit disables a number of tests related to dangling indices and command-line tools. Those will be addressed in follow-ups. Relates elastic#48701 * Add command-line tool support for Lucene-based metadata storage (elastic#50179) Adds command-line tool support (unsafe-bootstrap, detach-cluster, repurpose, & shard commands) for the Lucene-based metadata storage. Relates elastic#48701 * Use single directory for metadata (elastic#50639) Earlier PRs for elastic#48701 introduced a separate directory for the cluster state. This is not needed though, and introduces an additional unnecessary cognitive burden to the users. Co-Authored-By: David Turner <david.turner@elastic.co> * Add async dangling indices support (elastic#50642) Adds support for writing out dangling indices in an asynchronous way. Also provides an option to avoid writing out dangling indices at all. Relates elastic#48701 * Fold node metadata into new node storage (elastic#50741) Moves node metadata to uses the new storage mechanism (see elastic#48701) as the authoritative source. * Write CS asynchronously on data-only nodes (elastic#50782) Writes cluster states out asynchronously on data-only nodes. The main reason for writing out the cluster state at all is so that the data-only nodes can snap into a cluster, that they can do a bit of bootstrap validation and so that the shard recovery tools work. Cluster states that are written asynchronously have their voting configuration adapted to a non existing configuration so that these nodes cannot mistakenly become master even if their node role is changed back and forth. Relates elastic#48701 * Remove persistent cluster settings tool (elastic#50694) Adds the elasticsearch-node remove-settings tool to remove persistent settings from the on disk cluster state in case where it contains incompatible settings that prevent the cluster from forming. Relates elastic#48701 * Make cluster state writer resilient to disk issues (elastic#50805) Adds handling to make the cluster state writer resilient to disk issues. Relates to elastic#48701 * Omit writing global metadata if no change (elastic#50901) Uses the same optimization for the new cluster state storage layer as the old one, writing global metadata only when changed. Avoids writing out the global metadata if none of the persistent fields changed. Speeds up server:integTest by ~10%. Relates elastic#48701 * DanglingIndicesIT should ensure node removed first (elastic#50896) These tests occasionally failed because the deletion was submitted before the restarting node was removed from the cluster, causing the deletion not to be fully acked. This commit fixes this by checking the restarting node has been removed from the cluster. Co-authored-by: David Turner <david.turner@elastic.co>

Has the new cluster state storage layer emit warnings in case metadata performance is very slow. Relates elastic#48701

Adds a command-line tool to remove broken custom metadata from the cluster state. Relates to elastic#48701

ywelsch · 2020-02-27T08:41:25Z

Closing this issue as the work here was merged for 7.6.0

Re-enabled and fixed test after we now persist metadata in lucene. Relates elastic#48701

Re-enabled and fixed test after we now persist metadata in lucene. Relates #48701

DaveCTurner added >enhancement Meta :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. 7x labels Oct 30, 2019

DaveCTurner self-assigned this Oct 30, 2019

DaveCTurner mentioned this issue Oct 31, 2019

Introduce Lucene-based metadata persistence #48733

Merged

DaveCTurner mentioned this issue Nov 18, 2019

Remove per-index metadata without assigned shards #49234

Merged

ywelsch mentioned this issue Dec 12, 2019

Use Lucene exclusively for metadata storage #50144

Merged

$@polyfractal$ polyfractal removed the 7x label Dec 12, 2019

This was referenced Dec 13, 2019

Add command-line tool support for Lucene-based metadata storage #50179

Merged

Omit loading IndexMetaData when inspecting shards #50214

Merged

ywelsch added a commit that referenced this issue Dec 19, 2019

Add command-line tool support for Lucene-based metadata storage (#50179)

c6e1ea8

Adds command-line tool support (unsafe-bootstrap, detach-cluster, repurpose, & shard commands) for the Lucene-based metadata storage. Relates #48701

ywelsch self-assigned this Dec 19, 2019

This was referenced Jan 6, 2020

Use single directory for metadata #50639

Merged

Add async dangling indices support #50642

Merged

Remove persistent cluster settings tool #50694

Merged

ywelsch added a commit that referenced this issue Jan 8, 2020

Add async dangling indices support (#50642)

1a9f88f

Adds support for writing out dangling indices in an asynchronous way. Also provides an option to avoid writing out dangling indices at all. Relates #48701

ywelsch mentioned this issue Jan 8, 2020

Fold node metadata into new node storage #50741

Merged

ywelsch added a commit that referenced this issue Jan 8, 2020

Fold node metadata into new node storage (#50741)

c8bfe3d

Moves node metadata to uses the new storage mechanism (see #48701) as the authoritative source.

ywelsch mentioned this issue Jan 9, 2020

Write CS asynchronously on data-only nodes #50782

Merged

ywelsch added a commit that referenced this issue Jan 14, 2020

Warn on slow metadata performance (#50956)

91d7b44

Has the new cluster state storage layer emit warnings in case metadata performance is very slow. Relates #48701

ywelsch added a commit that referenced this issue Jan 14, 2020

Remove custom metadata tool (#50813)

d94b81e

Adds a command-line tool to remove broken custom metadata from the cluster state. Relates to #48701

ywelsch added a commit that referenced this issue Jan 14, 2020

Remove custom metadata tool (#50813)

4b0581f

Adds a command-line tool to remove broken custom metadata from the cluster state. Relates to #48701

$@polyfractal$ polyfractal added v7.7.0 and removed v7.6.0 labels Jan 15, 2020

ywelsch added v7.6.0 and removed v7.7.0 v8.0.0 labels Jan 17, 2020

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020

Warn on slow metadata performance (elastic#50956)

11c3dcd

Has the new cluster state storage layer emit warnings in case metadata performance is very slow. Relates elastic#48701

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020

Remove custom metadata tool (elastic#50813)

cf35515

Adds a command-line tool to remove broken custom metadata from the cluster state. Relates to elastic#48701

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

ywelsch added v7.6.1 v7.6.2 v7.6.0 and removed v7.6.0 v7.6.1 v7.6.2 labels Feb 27, 2020

ywelsch closed this as completed Feb 27, 2020

mfussenegger mentioned this issue Mar 24, 2020

ES Backports crate/crate#9796

Closed

37 tasks

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 15, 2021

Fix peer recovery create lease test

2804dd7

Re-enabled and fixed test after we now persist metadata in lucene. Relates elastic#48701

henningandersen mentioned this issue Jan 15, 2021

Fix peer recovery create lease test #67580

Merged

henningandersen added a commit that referenced this issue Jan 18, 2021

Fix peer recovery create lease test (#67580)

126830f

Re-enabled and fixed test after we now persist metadata in lucene. Relates #48701

henningandersen added a commit that referenced this issue Jan 18, 2021

Fix peer recovery create lease test (#67580)

43a26bb

Re-enabled and fixed test after we now persist metadata in lucene. Relates #48701

henningandersen added a commit that referenced this issue Jan 18, 2021

Fix peer recovery create lease test (#67580)

dffcbbc

Re-enabled and fixed test after we now persist metadata in lucene. Relates #48701

mkleen mentioned this issue Mar 24, 2021

Roadmap backports 4.6 crate/crate#11180

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of writes needed for metadata updates #48701

Reduce number of writes needed for metadata updates #48701

DaveCTurner commented Oct 30, 2019 •

edited by ywelsch

Loading

elasticmachine commented Oct 30, 2019

ywelsch commented Feb 27, 2020

Reduce number of writes needed for metadata updates #48701

Reduce number of writes needed for metadata updates #48701

Comments

DaveCTurner commented Oct 30, 2019 • edited by ywelsch Loading

elasticmachine commented Oct 30, 2019

ywelsch commented Feb 27, 2020

DaveCTurner commented Oct 30, 2019 •

edited by ywelsch

Loading