Druid Server Guilds and Guild Aware Replication #10739

capistrant · 2021-01-08T22:19:48Z

Description

Note: I use the term "Guild" throughout this proposal. It is an arbitrary name that I chose for the Historical grouping construct that I am proposing. A Guild would be a group of Historical servers. All servers who do not specify a Guild in runtime.properties would be assigned to a default Guild.

I am adding the idea of guilds in Druid. A guild is a logical grouping of servers. With this PR, the only use of guilds is replicant distribution across Historical Servers. The idea has been born out of the desire for HDFS like rack aware replication for bare metal deployment on-prem. For this use case, Druid Historical services are assigned a guild based on the physical rack that they live on. The coordinator uses SegmentReplicantLookup to build a lookup for segment:guild replicant count. The Coordinator has a preference for loading replicants across 2 or more guilds. It is important to note that, instead of having less replicants than specified by Druid Load Rules, the Coordinator will load replicants on the same guild if it must.

This idea can go beyond just physical racks in a data center and apply to things such as availability zones or arbitrary historical server groupings in a virtualized deployment. That is why I came up with the name "guild" instead of just saying rack explicitly.

Implementation

Configs
runtime.properties:

druid.server.guild The STRING guild name assigned to a server. Default value applied if not specified, _default_guild.
druid.coordinator.guildReplication.on The BOOLEAN flag telling the coordinator if it should use server guilds as a part of its coordination logic. Default value is false.

coordinator dynamic config:

guildReplicaitonMaxPercentOfSegmentsToMove What % of segments moved during BalanceSegments duty should be dedicated to moving segments that are not meeting guild replication threshold of 2? This is applied against what is left over after moving segments off of decommissioning servers, if there are any. Default value is 0.
emitGuildReplicationMetrics Boolean whether or not to emit metrics for guild replication status. This PR adds one metric that is a count of segments not loaded onto multiple guilds.

Code Changes

SegmentReplicantLookup

The coordinator currently builds this object each time the HistoricalManagementDuties run. Prior to this proposal, the object would contain two main data structures:

Table<SegmentId, String, Integer> segmentsInCluster - replicant count for each segment by tier for segments served by the cluster.
Table<SegmentId, String, Integer> loadingSegments - replicant count for each segment by tier for segments being loaded by the cluster.

My proposal adds a third data structure that is specific to guild replication:

Table<SegmentId, String, Integer> historicalGuildDistribution - replicant count for each segment by guild.

This new structure is only created if guild replication is enabled. It is not worth the resources if we are not going to use it! The structure is used for quick lookup to get guild replication state for a given segment. It will be used by coordinator duties when making decisions on loading/moving/dropping segments.

LoadRule

When Assigning replicas, use the changes in SegmentReplicantLookup in order to split ServerHolders into groups of servers who are on a guild that is serving a replicant of the segment and servers who are on a guild that is not serving a replicant of the segment. If possible LoadRule will load replicants to the best scored server(s) from the guild(s) not serving a replicant. However, we always fallback to just loading the specified number of replicants even if replication across guilds cannot be achieved.

When picking replicants to drop, we will also split the ServerHolders in the same way. The segments will be dropped from decommissioning servers first, then from servers on guilds with > 1 replicant of the segment, then lastly from the remaining servers serving a replicant.

Balancer Strategy

Add a method to the interface. pickGuildReplicationViolatingSegmentToMove(List<ServerHolder> serverHolders, Set<String> broadcastDataSources, DruidCoordinatorRuntimeParams params);

This new method is added so we have the SegmentReplicantLookup information needed to pick a segment who is not replicated across > 1 guild when balancing the cluster. It is needed because we introduce the dynamic config and associated balancing phase in BalanceSegments that prioritizes the balancing of segments who are not properly replicated across guilds. We need to only pick segments that meet the requirements. The existing pickSegmentToMove does not suffice.

RandomBalancerStrategy and CostBalancerStrategy add implementations.

ReservoirSegmentSampler

Adds a method for getting a SegmentHolder that is violator of the goal of being on > 1 guilds. This method needs to have access to the SegmentReplicantLookup in order to quickly look up replication state of segments it is possibly selecting. It returns the first violator that it finds or null if none is found.

BalanceSegments

The coordinator balancing duty gets a couple of changes. The first change is to the generic balancing that exists today. If guild replication is enabled, then we will perform the split of ServerHolders based on their guild replication status when looking for a server to move a segment to. Just as in LoadRule we will do our best to make the move to a server that improves or maintains the number of guilds that hold a replicant.

We also add a new phase of balancing segments. There is a dedicated move for segments off of decommissioning servers. An operator can also choose to add a dedicated move for segments that are not replicated on > 1 guild. They do this by editing the dynamic coordinator config that this proposal adds. This results in the coordinator moving a certain number of segments that are violating guild replication rules. It is an optional way for an operator to speed up the balancing of segments across guilds.

Design Choice Rationale + Alternatives

There has been some feedback that this functionality should be folded into the tier construct that already exists. I tend to disagree with that idea for multiple reasons.

Tiers have been around for a long time and have become cornerstones of operator's deployments. It may not be easy or straightforward to re-purpose tiers to also work as generic replication groups. I fear we'd be creating a single monster that is hard to understand vs having two separate concepts split out into their own things (guilds vs tiers)
As written, tiers are explicit. I load N replicants onto Tier A, M onto Tier B, etc. Whereas guild replication is a generic best effort goal. I try to load the aggregate replicants across all tiers onto 2 or more guilds. We can't change tiers to be generic with loading to go across 2 or more tiers, because operators expect their load rules to be followed to a T. So we end up with competing goals that do not play nicely together. This ties back to my point that trying to adapt tiering to meet so many goals could create a confusing beast.
My motivating goal for uptime and maintainability may require a separate grouping outside of my existing tiering. Perhaps servers for my different tiers are mixed and matched within logical groupings that I would need to separate out in order to achieve my motivating goals. If my grouping for uptime and maintainability is in regards to physical racks. I could have my slow and fast servers racked on the same set of racks with a few of each form-factor on any given rack. This means I cannot achieve my same tiers for my speed tiering as well as my replication goals. I would need to further split each rack into tiers, which once again is getting complex and probably messing up performance by adding way too much isolation vs shared resources.

So now that adapting tiering is ruled out as far as I am concerned, what else could we do? I can't think of other solutions for the configurations that are as generic as my simple guild name assignment and boolean for the coordinator on whether or not to follow guild replication logic paths. But what about the implementation details using these configurations? It could certainly be argued that the cost of the new SegmentReplicantLookup data structure is too steep for large clusters. However, I fail to come up with a better solution in my head. When we are choosing servers to load/drop segments from, we need a quick way to look up details on where that segment is loaded. It seems logical that this facility would require this large structure in the replicant lookup.

With all this being said, I am open to improvements to my proposal. At the end of the day my motivating factors just need to be achieved. If there is a better way to achieve the simple goals required, I'm all for it.

Operational impact of deployment

- Is anything going to be deprecated or removed by this change? How will we phase out old behavior?

Nothing is deprecated or removed.

- Is there a migration path that cluster operators need to be aware of?

To go from not using guild replication, to using guild replication, there will be a simple migration path. Operators will need to assign guilds to their historical servers and restart them. They will then have to flip the coordinator config to turn on guild replication. They may choose to use the coordinator dynamic config to speed up balancing based on guild replication if they so choose.

- Will there be any effect on the ability to do a rolling upgrade, or to do a rolling downgrade if an operator wants to switch back to a previous version?

Upgrading to the first version of druid that has guild replication in it will not require any special work. Upgrading to the default configs will keep guild replication off. If they choose to use guild replication, they can perform the migration steps after the upgrade. If they choose not to use guild replication, no action is needed! Downgrading would require a pre-step by an operator if they have already turned guild replication on. They would need to turn guild replication off on the coordinator by updating their config and restarting it. That way there will not be any issues with split versions across coordinator/historical during downgrade.

- Other

This change as proposed results in additional resource utilization by the Coordinator. Extra data structures are created in SegmentReplicantLookup that help make replicant loading decisions based on guild distribution of segments. This may require attention from operators as far as configuring their runtime environment for the coordinator.

The changed as proposed introduces a major change to the decision making process by the Coordinator. Enabling guild aware replication may result in replicant loading decisions that the coordinator would have previously considered sub-optimal. Loading replicants to different guilds to gain better uptime and maintainability standards could result in segment distribution skew that negatively impacts performance. It is a trade-off that an operator will need to carefully consider.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Key changed/added classes in this PR

SegmentReplicantLookup
CoordinatorDynamicConfig
BalanceSegments
LoadRule
ReserviorSegmentSampler
BalancerStrategy interface
DruidCoordinatorConfig
EmitClusterStatsAndMetrics
DruidServer

capistrant · 2021-02-27T00:16:09Z

server/src/test/java/org/apache/druid/server/coordinator/rules/LoadRuleTest.java

+   * Drop will always happen on default guild to retain guild distribution.
+   */
+  @Test
+  public void testDropMultipleGuilds()


This test fails in CI despite never failing locally. Needs to be looked into

After stubbing out a load queue peon method I was able to get this test working.

stale · 2022-04-19T05:56:02Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

capistrant · 2022-04-19T21:07:21Z

dont close

stale · 2022-04-19T21:07:23Z

This issue is no longer marked as stale.

capistrant · 2022-10-06T19:08:40Z

closing in favor of future attempt

capistrant added 3 commits January 8, 2021 15:09

Add guild awareness to druid historical codebase

54f0cfa

clean up some CoordinatorDynamicConfig code and tests

8b4cf64

formatting cleanup

b81af25

capistrant added Design Review Area - Segment Balancing/Coordination labels Jan 8, 2021

capistrant added 8 commits January 11, 2021 09:18

fix docs CI

3b0cc4c

Merge branch 'master' into implements-9816

d3bacb8

rename test

d36077b

Merge branch 'master' into implements-9816

6e3f596

Merge branch 'master' into implements-9816

edfc46b

Add a doc dedicated to guild replication under operations

f0e0779

web-console update

ae6836e

refactoring and docs

f4af712

capistrant commented Feb 27, 2021

View reviewed changes

capistrant added 16 commits March 17, 2021 09:43

doc and comment updates

5ab7a3d

Merge branch 'master' into implements-9816

6f8dd79

changes for testing

607d119

refactoring and clean up

3258f5a

Merge branch 'master' into implements-9816

f2e3a59

Fix some tests after merge with master

81016eb

Merge branch 'master' into implements-9816

e7560e0

some cleanup for tests after merge with master

37603ea

Merge branch 'master' into implements-9816

776aa21

fixup tests after merge with master

90d443d

Merge branch 'master' into implements-9816

0ee5c85

resolve compilation errors after merging with master

0d46756

Merge branch 'master' into implements-9816

636ca53

Recocile compile issues after merge with master

7c8195a

update jest snapshot after CI failure

20552ef

Merge branch 'master' into implements-9816

7bef0ca

capistrant added 5 commits June 15, 2021 14:48

post merge with master cleanup

5f25234

Merge branch 'master' into implements-9816

20850d0

Merge branch 'master' into implements-9816

5792f1b

Merge branch 'master' into implements-9816

57655e0

Use new way of selecting segments for balancing

73ab9a6

stale bot added the stale label Apr 19, 2022

stale bot removed the stale label Apr 19, 2022

capistrant closed this Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid Server Guilds and Guild Aware Replication #10739

Druid Server Guilds and Guild Aware Replication #10739

capistrant commented Jan 8, 2021 •

edited

Loading

capistrant Feb 27, 2021

capistrant Apr 13, 2021 •

edited

Loading

stale bot commented Apr 19, 2022

capistrant commented Apr 19, 2022

stale bot commented Apr 19, 2022

capistrant commented Oct 6, 2022

Druid Server Guilds and Guild Aware Replication #10739

Druid Server Guilds and Guild Aware Replication #10739

Conversation

capistrant commented Jan 8, 2021 • edited Loading

Description

Implementation

Design Choice Rationale + Alternatives

Operational impact of deployment

Key changed/added classes in this PR

capistrant Feb 27, 2021

Choose a reason for hiding this comment

capistrant Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

stale bot commented Apr 19, 2022

capistrant commented Apr 19, 2022

stale bot commented Apr 19, 2022

capistrant commented Oct 6, 2022

capistrant commented Jan 8, 2021 •

edited

Loading

capistrant Apr 13, 2021 •

edited

Loading