Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties #10606

capistrant · 2020-11-25T18:02:26Z

Motivation

Loading primary replicants for Druid Segments is one of the most important things that the Coordinator does. Without a primary replicant available on the cluster, a segment is not available for querying. The Coordinator performs primary replicant loading within a set of Coordinator duties that relate to Historical Management. This grouping can result in the coordinator spending a lot of time doing other things such as loading non-primary replicants, balancing segments, etc. A side effect of this waiting for other Coordinator jobs to complete before more primary replicants can be loaded is that data stays unavailable for longer than it otherwise might have to. This can be a negative end user experience. Breaking primary replicant loading out into its own scheduled runnable group can guarantee that primary replicants are loaded more regularly.

Proposed changes

POC Code Link

I am proposing an optional new DutiesRunnable in the DruidCoordinator. Operators can choose whether or not to break primary replicant loading out into its own DutiesRunnable. If they choose not to enable the dedicated primary replicant loading, their coordinator will function just as it always has. If they choose to enable the dedicated primary replicant loading, their coordinator will add a scheduled DutiesRunnable dedicated to executing matching LoadRule for segments and only doing the primary replicant load for that LoadRule when ran. The HistoricalManagement DutiesRunnable will continue all other HistoricalManagement duties including performing non-primary replicant loading and replicant dropping while executing a matched LoadRule for a segment.

My POC implementation for the proposal exposes two new Coordinator runtime configurations for operators: druid.coordinator.loadPrimaryReplicantSeparately and druid.coordinator.period.primaryReplicantLoaderPeriod. If they choose to enable the first, then a scheduled executor with a configurable backoff period is configured for loading primary replicants.

The new DutiesRunnable would have consist of two duties, UpdateCoordinatorStateAndPrepareCluster and RunRules.

There is an open TODO on analyzing the negative effects of having two DutiesRunnable with UpdateCoordinatorStateAndPrepareCluster. It is possible that only one of the two should execute the full thing and the other should run a scaled down duty.

RunRules and LoadRule will need a mode associated with them. Now we will be executing RunRules in one of two modes. One mode is to only execute LoadRule rules that match. The other is to run all matched Rule. LoadRule is similar, for the primary replicant load, it should run in a mode where it only loads a primary replicant. There also needs to be a mode for skipping primary replicant load. And then lastly, a mode for running all of LoadRule and not worrying about replicant types.

Rationale

I think the biggest benefit here is more control for the operator to ensure that primary replicant loading is running as often as needed. In the case of large clusters who do lots of balancing, and non-primary replicant loading due to servers coming in and out of the cluster, primary replicant loading can get blocked often enough that users are asking about why their new segments aren't becoming available in a timely manner after batch indexing finishes.

As for alternative approaches, I have not thought of any similar ways to achieve this elevated priority for loading primary replicants at this time. I am definitely open to suggestions though.

Operational impact

This section should describe how the proposed changes will impact the operation of existing clusters. It should answer questions such as:

Is anything going to be deprecated or removed by this change? How will we phase out old behavior?
- N/A
Is there a migration path that cluster operators need to be aware of?
- Enabling this requires coordinator config changes and a restart.
Will there be any effect on the ability to do a rolling upgrade, or to do a rolling downgrade if an operator wants to switch back to a previous version?
- rolling upgrade to the first version that includes this would not require any changes because not adding the configs will leave the coordinator as is. An operator can enable after upgrade if they so choose.
- Downgrading should not have any impact. The configs, even if specified by operator would be ignored and coordinator would go back to how it operated before there was a dedicated primary replicant loader.

Test plan (optional)

TBD

Future work (optional)

TBD

The text was updated successfully, but these errors were encountered:

OurNewestMember · 2022-11-11T17:47:47Z

+1

If this feature allows an admin to set a configuration property on the (dynamic) coordinator config to control primary replicant loading specifically, this could be extremely valuable because it could potentially improve realtime task publishing time (which in turn can prevent pending or even failed tasks and therefore ingest lag, etc) and also prevent segments in limbo that have been published but are not queryable (which may even relate to query failure rates, too).

This is such a critical aspect of the system, and it deserves as much of an interface as non-primary replicants currently have in the coordinator dynamic config. +1

capistrant added Design Review Proposal Area - Segment Balancing/Coordination labels Nov 25, 2020

capistrant mentioned this issue Dec 21, 2020

Add a standalone primary replicant loader #10699

Closed

7 tasks

danprince1 mentioned this issue Dec 1, 2021

Create a standalone primary replicant loader in the Coordinator #12013

Closed

9 tasks

OurNewestMember mentioned this issue Nov 19, 2022

Druid does not intermittently drop segments past retention time #12458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties #10606

Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties #10606

capistrant commented Nov 25, 2020 •

edited

Loading

OurNewestMember commented Nov 11, 2022

Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties #10606

Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties #10606

Comments

capistrant commented Nov 25, 2020 • edited Loading

Motivation

Proposed changes

Rationale

Operational impact

Test plan (optional)

Future work (optional)

OurNewestMember commented Nov 11, 2022

capistrant commented Nov 25, 2020 •

edited

Loading