Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add configurable Server Guild attribute to enable the coordinator to make replication decisions that spread replicants across operator defined guilds #9816

Open
capistrant opened this issue May 4, 2020 · 21 comments

Comments

@capistrant
Copy link
Contributor

capistrant commented May 4, 2020

Note: I use the term "Guild" throughout this proposal. It is an arbitrary name that I chose for the Historical grouping construct that I am proposing. A Guild would be a group of Historical servers. All servers who do not specify a Guild in runtime.properties would be assigned to a default Guild.

Motivation

As of now, the only guarantee for segment loading is that no two replicants will be loaded onto the same Historical Server. You can also use tiering to load a specific number of replicants onto specified tiers for a datasource via load rules. As an operator of a large cluster, I want to maximize segment availability and optimize maintainability of my cluster. To achieve goal number one, I want my cluster to do its best to load replicants to groups of servers that have the lowest likelihood of going down simultaneously causing loss of segment availability. To achieve goal number two, I want my cluster to do its best to load replicants to groups of servers that allow me to perform maintenance (such as service restarts, OS patching) on groups of servers at one time without worrying about losing segment availability while doing so. Both of these goals can be achieved by grouping historical servers and making the coordination algorithm aware of the groups. The coordination algorithm can then use the groups as input into its decision making on where to load segments on the cluster, making best effort to load replicants across 2+ groups. If this is done, I can, as an operator, set up my groupings to meet my goals.

Proposed changes

I have a POC going in my fork of Druid. The code can be viewed here.

Configs
runtime.properties:

  • druid.server.guild The STRING guild name assigned to a server. Default value applied if not specified.
  • druid.coordinator.guildReplication.on The BOOLEAN flag telling the coordinator if it should use server guilds as a part of its coordination logic. Default value is false.

coordinator dynamic config:

  • guildReplicaitonMaxPercentOfSegmentsToMove What % of segments moved during BalanceSegments duty should be dedicated to moving segments that are not meeting guild replication threshold of 2? This is applied against what is left over after moving segments off of decommissioning servers, if there are any.

SegmentReplicantLookup

The coordinator currently builds this object each time the HistoricalManagementDuties run. Prior to this proposal, the object would contain two main data structures:

Table<SegmentId, String, Integer> segmentsInCluster - replicant count for each segment by tier for segments served by the cluster.
Table<SegmentId, String, Integer> loadingSegments - replicant count for each segment by tier for segments being loaded by the cluster.

My proposal adds a third data structure that is specific to guild replication:

Table<SegmentId, String, Integer> historicalGuildDistribution - replicant count for each segment by guild.

This new structure is only created if guild replication is enabled. It is not worth the resources if we are not going to use it! The structure is used for quick lookup to get guild replication state for a given segment. It will be used by coordinator duties when making decisions on loading/moving/dropping segments.

LoadRule

When Assigning replicas, use the changes in SegmentReplicantLookup in order to split ServerHolders into groups of servers who are on a guild that is serving a replicant of the segment and servers who are on a guild that is not serving a replicant of the segment. If possible LoadRule will load replicants to the best scored server(s) from the guild(s) not serving a replicant. However, we always fallback to just loading the specified number of replicants even if replication across guilds cannot be achieved.

When picking replicants to drop, we will also split the ServerHolders in the same way. The segments will be dropped from decommissioning servers first, then from servers on guilds with > 1 replicant of the segment, then lastly from the remaining servers serving a replicant.

BalancerStrategy

Add a method to the interface. pickSegmentToMove(List<ServerHolder> serverHolders, Set<String> broadcastDataSources, DruidCoordinatorRuntimeParams params);

This new method is added so we have the SegmentReplicantLookup information needed to pick a segment who is not replicated across > 1 guild when balancing the cluster. It is needed because we introduce the dynamic config and associated balancing phase in BalanceSegments that prioritizes the balancing of segments who are not properly replicated across guilds. We need to only pick segments that meet the requirements. The existing pickSegmentToMove does not suffice.

RandomBalancerStrategy and CostBalancerStrategy add implementations.

ReservoirSegmentSampler

Adds a method for getting a SegmentHolder that is violator of the goal of being on > 1 guilds. This method needs to have access to the SegmentReplicantLookup in order to quickly look up replication state of segments it is possibly selecting. It returns the first violator that it finds or null if none is found.

BalanceSegments

The coordinator balancing duty gets a couple of changes. The first change is to the generic balancing that exists today. If guild replication is enabled, then we will perform the split of ServerHolders based on their guild replication status when looking for a server to move a segment to. Just as in LoadRule we will do our best to make the move to a server that improves or maintains the number of guilds that hold a replicant.

We also add a new phase of balancing segments. There is a dedicated move for segments off of decommissioning servers. An operator can also choose to add a dedicated move for segments that are not replicated on > 1 guild. They do this by editing the dynamic coordinator config that this proposal adds. This results in the coordinator moving a certain number of segments that are violating guild replication rules. It is an optional way for an operator to speed up the balancing of segments across guilds.

Rationale

There has been some feedback that this functionality should be folded into the tier construct that already exists. I tend to disagree with that idea for multiple reasons.

  • Tiers have been around for a long time and have become cornerstones of operator's deployments. It may not be easy or straightforward to re-purpose tiers to also work as generic replication groups. I fear we'd be creating a single monster that is hard to understand vs having two separate concepts split out into their own things (guilds vs tiers)
  • As written, tiers are explicit. I load N replicants onto Tier A, M onto Tier B, etc. Whereas guild replication is a generic best effort goal. I try to load the aggregate replicants across all tiers onto 2 or more guilds. We can't change tiers to be generic with loading to go across 2 or more tiers, because operators expect their load rules to be followed to a T. So we end up with competing goals that do not play nicely together. This ties back to my point that trying to adapt tiering to meet so many goals could create a confusing beast.
  • My motivating goal for uptime and maintainability may require a separate grouping outside of my existing tiering. Perhaps servers for my different tiers are mixed and matched within logical groupings that I would need to separate out in order to achieve my motivating goals. If my grouping for uptime and maintainability is in regards to physical racks. I could have my slow and fast servers racked on the same set of racks with a few of each form-factor on any given rack. This means I cannot achieve my same tiers for my speed tiering as well as my replication goals. I would need to further split each rack into tiers, which once again is getting complex and probably messing up performance by adding way too much isolation vs shared resources.

So now that adapting tiering is ruled out as far as I am concerned, what else could we do? I can't think of other solutions for the configurations that are as generic as my simple guild name assignment and boolean for the coordinator on whether or not to follow guild replication logic paths. But what about the implementation details using these configurations? It could certainly be argued that the cost of the new SegmentReplicantLookup data structure is too steep for large clusters. However, I fail to come up with a better solution in my head. When we are choosing servers to load/drop segments from, we need a quick way to look up details on where that segment is loaded. It seems logical that this facility would require this large structure in the replicant lookup.

With all this being said, I am open to improvements to my proposal. At the end of the day my motivating factors just need to be achieved. If there is a better way to achieve the simple goals required, I'm all for it.

Operational impact

This section should describe how the proposed changes will impact the operation of existing clusters. It should answer questions such as:

- Is anything going to be deprecated or removed by this change? How will we phase out old behavior?

Nothing is deprecated or removed.

- Is there a migration path that cluster operators need to be aware of?

To go from not using guild replication, to using guild replication, there will be a simple migration path. Operators will need to assign guilds to their historical servers and restart them. They will then have to flip the coordinator config to turn on guild replication. They may choose to use the coordinator dynamic config to speed up balancing based on guild replication if they so choose.

- Will there be any effect on the ability to do a rolling upgrade, or to do a rolling downgrade if an operator wants to switch back to a previous version?

Upgrading to the first version of druid that has guild replication in it will not require any special work. Upgrading to the default configs will keep guild replication off. If they choose to use guild replication, they can perform the migration steps after the upgrade. If they choose not to use guild replication, no action is needed! Downgrading would require a pre-step by an operator if they have already turned guild replication on. They would need to turn guild replication off on the coordinator by updating their config and restarting it. That way there will not be any issues with split versions across coordinator/historical during downgrade.

- Other

This change as proposed results in additional resource utilization by the Coordinator. Extra data structures are created in SegmentReplicantLookup that help make replicant loading decisions based on guild distribution of segments. This may require attention from operators as far as configuring their runtime environment for the coordinator.

The changed as proposed introduces a major change to the decision making process by the Coordinator. Enabling guild aware replication may result in replicant loading decisions that the coordinator would have previously considered sub-optimal. Loading replicants to different guilds to gain better uptime and maintainability standards could result in segment distribution skew that negatively impacts performance. It is a trade-off that an operator will need to carefully consider.

TODO (for completeness of this proposal)

  • I believe I still need to add logic for what to do when there is only 1 total replicant expected for a segment across all tiers. If this is true, then there really is not a reason to make guild related loading/balancing decisions for it since there is only ever 1 replicant. This one is tough to deal with. We'd have to dive into load rules for each segment we are evaluating for moving to take this into account during BalanceSegments
  • Evaluate need for integration tests
  • Evaluate possibility for simulation tool. Could we take a cluster snapshot and produce a simulated output of the cluster state once it is balanced for guild replication?

Test plan (optional)

Future work (optional)

@himanshug
Copy link
Contributor

Isn't this doable using multiple historical tiers ... e.g.
some historicals with druid.tier=A
some historicals with druid.tier=B

and then using tieredReplicants in the loadRule e.g.

  "tieredReplicants": {
      "A": 1,
      "B" : 1
  }

@capistrant
Copy link
Contributor Author

for a case where number of groups or racks = number of replicas desired, sure. But a large enterprise cluster will have 10s of racks and will not want a replica on every rack due to cost concerns. And trying to automatically generate load rules that properly balance datasources across those groups by selecting tiers would be burdensome and likely hard to do well.

@himanshug
Copy link
Contributor

himanshug commented Oct 26, 2020

for a case where number of groups or racks = number of replicas desired, sure. But a large enterprise cluster will have 10s of racks and will not want a replica on every rack due to cost concerns.

that should be possible I think.

say, you have 4 racks/groups/tiers ... A, B, C, D

with config...

  "tieredReplicants": {
      "A": 1,
      "B" : 2
  }

number of replicants != number of tiers , and not every replica is on each tier.

or maybe I misunderstood above statement.

And trying to automatically generate load rules that properly balance datasources across those groups by selecting tiers would be burdensome and likely hard to do well.

Yeah, things could possibly be simplified but use of tiers does let you achieve the end goal. Just wanted to point out the existing alternative and also, if possible, it would be nice if tier based grouping could be enhanced to take care of additional use cases or simplification instead of introducing another way of grouping historical.

@ArvinZheng
Copy link
Contributor

ArvinZheng commented Oct 26, 2020

@himanshug ,
Yes, use multiple tiers can be an alternative for some cases, but it doesn't work in the case that you want to keep the same number of active replicants regardless of how many racks/zones that you have in the cluster. For example, there will be only one active replicant if rack B goes down in your example. And if both A and B go down, the cluster/coordinator won't utilize C and D to serve the data from A and B automatically.

@himanshug
Copy link
Contributor

that is true and I am suggesting if possible, that enhancement could be made to tier concept itself rather than introducing another thing to do the grouping

just for illustration, having the config be something like e.g.

"numReplicas": 3
"tiers": [ "A", "B", "C" ]     //optional and defaults to all/any available tiers

@ArvinZheng
Copy link
Contributor

I see your point, but people (like me) may have gotten used to leverage tiers to split data into cold and hot groups and allocation different resources for cold and hot tiers. While rack awareness is a different concept which we want to make a tier more tolerant for rack/zone failures. I vote to use different configs/terms for introducing this feature.

@dougbyrne
Copy link

In the case of cloud hosting, I generally think of "rack" corresponding to "availability-zone." My tiers and AZs will overlap in ways that don't make sense in the context of physical racks.

We are in the process of adding on-demand and spot tiers. I can specify the number of replicas for each tier to mitigate against the possibility of spot instances becoming unavailable. I also need to mitigate against AZ failure. Each AZ can contain a mix of on-demand and spot nodes. I would like the distribution of replicas to also be aware of AZ, even across the tiers.

I'm not sure that having a tier for each combination of AZ and pricing class will satisfy both conditions.

AWS Jargon: On-demand instances are the "normal" type. Spot instances are lower cost, but can be pre-empted by requests for on-demand instances.

@ArvinZheng
Copy link
Contributor

@dougbyrne ,
"rack" doesn't have to be a physical rack here, in your case the value of druid.rack will be AZ.

The goal of this PR is to make sure Druid distributes replications between racks/AZs within a tier, so to achieve this, you should make sure every tier in your Druid cluster has historical nodes run in multiple AZs.
@capistrant correct me if I am wrong.

I would like the distribution of replicas to also be aware of AZ, even across the tiers.

@dougbyrne
Copy link

Yes, that's what I said, in the context of cloud hosting, a "rack" is a AZ. I'm trying to illustrate why tiers do not satisfy my distribution requirements alone.

Lets I have tier 1 in rack A, rack B and rack C, and tier 2 also in racks A, B, and C. Then I configure tier 1 to have one 1 replica, and tier 2 to have 2 replicas. If my tier 1 replica was in rack A, and my tier 2 replicas were in racks A and B, that would not be acceptable to me. I should not have multiple replicas in the same rack when possible.

I'm not sure extending tiers will address this, as tiers and rack are different dimensions and can overlap.

@ArvinZheng
Copy link
Contributor

hmm, sorry @dougbyrne , I'm not sure if I understood your concern.

In the example that you posted, tier 1 wouldn't be beneficial by the rack awareness proposal since you have only 1 replica in that tier, so anyway once you lose one rack, a portion of data will be unavailable until they are redistributed to other rack. But the data in tier 2 are supposed to be able to tolerate at least one zone failure by this proposal since the segment distribution will prioritize rack first, and if you configure tier 2 to have 3 replicas, then we would expect tier 2 is able to tolerate 2 zones being unavailable.

@dougbyrne
Copy link

In my example, I'm loading the same data in both Tier 1 and Tier 2. I have a total of 3 replicas. The rack awareness needs to cross tier boundaries.

@ArvinZheng
Copy link
Contributor

Ah, I see, may I ask one more question of your use case? What's the purpose to set up 2 tiers to serve the same data?

@dougbyrne
Copy link

I have one tier running on On-Demand instances, another tier running on spot (preemptable) instances. I want to ensure that there is at least one replica for our hottest data that will not be preempted.

@ArvinZheng
Copy link
Contributor

I see, would it be possible/acceptable to serve the data in the same tier, and use different AZs for on-demand and spot instances? So that you don't have to use multiple tiers to serve the same data and let Druid to replicate the segments to different AZs and make sure the cluster is able to tolerate AZ failures (assuming we have this rack awareness feature implemented and enabled)?

@dougbyrne
Copy link

That would not be ideal. The availability of spot instances is different per AZ, so we want to have as many AZs available to our spot tier. We also need to have multiple AZs available in the on-demand tier to mitigate the possibility of a AZ failure.

I think you raise a good point that tiers are not the best way to solve this problem. Maybe instead of having just rack diversity, it should be more generic. Rack and Zone and Price Class are all dimensions you might want to diversify across.

Also, our load rules are a bit more complicated. The most recent data are in both the on-demand tier and the hot-spot tier. The nxt most recent data is only in that same hot-spot tier. The remaining data is loaded on a cold-spot tier. We also only want to have one replica in the on-demand tier, while we might have more in the spot tier.

@capistrant
Copy link
Contributor Author

@dougbyrne @ArvinZheng @himanshug

Hi all. Sorry for not being active in the discussion last week. I have revisited my initial proposal and came up with a new POC branch that I think is better than the first attempt. code link here

I walked away from the use of the term "rack". For now I have used the term "guild".... but naming is not the issue at hand here, we can easily change that. At the end of the day, a guild is just a group. If an operator so chooses they can assign every historical to a guild and then tell the coordinator that guild replication is on. This means that the coordinator should make a best effort to load replicants onto guilds that are not already used whenever possible. However, if a replicant has to go onto a used guild, the coordinator will do this because having N replicants on one guild is better than N-1 replicants in the cluster when there are no unused guilds to place a segment. A guild can be anything. For my team, it will be a physical rack. For Doug, it may be an AZ or an AZ:price class combination.

I am not a very big fan of layering into this into tiering. Tiering as is seems so much more like a query routing/prioritization/isolation feature. I want my router/broker to make query choices based on my tiers. I do not want them to care about my guilds. A segment being on guild A vs guild Z means nothing in terms of which should be queried. It is just a way to help operators improve operational simplicity and uptime reliability. If we were going to roll this into tiering, I fear that we would create one very confusing beast rather than two separate and generally simple ideas.

My aforementioned link to my POC code still needs heavy testing and iteration to work out the kinks (I'm sure there are many). But I think it is a good anchor point to base future discussions on. I can take the time to update my base proposal with new details to this implementation if people think that is a good idea. Please let me know! In the meantime I will continue working on testing/validation. I really want to look into the balancer profiler and see what that is all about and if I can turn it into some type of simulation of a cluster at scale turning on guild awareness.

@himanshug
Copy link
Contributor

I haven't seen the code but from a quick read of comments above, I see that proposed new configuration would cover multiple segment distribution related use cases , so +1 on that.

@capistrant capistrant changed the title Historical rack aware (or group aware) data replication Proposal: Add configurable Server Guild attribute to enable the coordinator to make replication decisions that spread replicants across operator defined guilds Nov 23, 2020
@capistrant
Copy link
Contributor Author

I've updated the proposal to reflect the latest implementation I have used for my POC. Will be adding some more changes for testing this week I hope.

@ngaugler
Copy link

ngaugler commented Sep 7, 2023

Has any progress been made on anything over the years or is it still impossible to appropriately redundantly balance data?

@abhishekagarwal87
Copy link
Contributor

Is the tiering not sufficient for the use case you have? It won't solve for all scenarios but is good enough for many of them.

@capistrant
Copy link
Contributor Author

Has any progress been made on anything over the years or is it still impossible to appropriately redundantly balance data?

This has remained stalled, but I plan on picking up the effort again sometime this fall if it remains open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants