Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for selective loading of lookups in the task layer #16328

Merged
merged 17 commits into from
Apr 29, 2024

Conversation

Akshat-Jain
Copy link
Contributor

@Akshat-Jain Akshat-Jain commented Apr 24, 2024

Description

This PR adds support for loading a specified subset of lookups in the task layer via a new class named LookupLoadingSpec. It updates the LookupReferencesManager to support 3 modes:

  1. Load all the lookups if LookupLoadingSpec.mode is ALL.
  2. Don't load any lookups if LookupLoadingSpec.mode is NONE.
  3. Load only the specified lookups if LookupLoadingSpec.mode is ONLY_REQUIRED. The lookups to load are consumed using LookupLoadingSpec#getLookupsToLoad().

This PR also updates the KillUnusedSegmentsTask to return an empty set of lookups to load, as this task has no dependency on any lookups. Future work would involve implementing this functionality for other tasks, depending on their dependency on lookups.

Test plan

  1. I did some manual testing which involved running KillUnusedSegmentsTask with different implementations of getLookupsToLoad() method, to test the behavior of all the 3 aforementioned modes.
  2. Added or updated unit tests to validate the changed functionality.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz
Copy link
Contributor

kfaraz commented Apr 24, 2024

There is another PR #16266 which also addresses the same problem, but takes a slightly different approach.

cc: @cryptoe , @Akshat-Jain

@kfaraz
Copy link
Contributor

kfaraz commented Apr 24, 2024

As I see it, pros of this approach:

  • It doesn't require changing the way a CliPeon is launched.
  • Allows specific task implementations such as MsqControllerTask to give dynamic implementations of their getLookupsToLoad() methods, thus allowing them to load any list of lookups. Even with the other approach, this method would be needed at some point down the line.
  • Doesn't require a new enum.

Cons of this approach:

  • We have attached a special meaning to null. In other words, null here means Load Everything, which is not too great. This can always be remedied though by returning a wrapper object instead of just a List<String>. The wrapper object can have a boolean or enum flag to indicate the tristate: ALL, NONE, REQUIRED.
  • Based on this comment (Do not Load lookups on tasks that will not use them #13324 (comment)), it seems that MSQ workers know about the lookups they need to load only after they start running (or maybe I am reading it wrong). The approach in this PR injects the list of lookups to load at start-up itself. Is the list of lookups required for a worker included in its task payload?

cc: @cryptoe

@Akshat-Jain
Copy link
Contributor Author

Is the list of lookups required for a worker included in its task payload?

@kfaraz IIUC, it's not. This PR only intends to add support for selective loading of lookups in the task layer, with task changes only for a sample task (KillUnusedSegmentsTask). The work for extracting the list of lookups and passing them for tasks that need lookups is pending, and I was thinking of it as my next step as a separate PR.

With that said, this should be applicable for both approaches.

@IgorBerman
Copy link
Contributor

@Akshat-Jain Hi
do you mind to mark Compaction task not to load lookups as well please? same as you did for Kill task ?

@Akshat-Jain
Copy link
Contributor Author

@IgorBerman Sure, I can do that! :)
Will add it first thing tomorrow!

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @Akshat-Jain ! Overall, the changes look good. I have left some minor suggestions.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing up the tests!

@Akshat-Jain
Copy link
Contributor Author

@IgorBerman Have updated CompactionTask to not load lookups as well.
cc: @kfaraz

@Provides
@LazySingleton
@Named(DataSourceTaskIdHolder.LOOKUPS_TO_LOAD_FOR_TASK)
public List<String> getLookupsToLoad(final Task task)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work for indexers and mm less task system. Does that code go via cliPeon ? Maybe you can test locally and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexer is a single process, so it should always load all lookups, right? IIUC, the changes here would only bind the lookupsToLoad for peons, which is what we want.

Not too sure about MM-less system. The other PR had tweaked K8sTaskAdapter to handle it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using Indexers. All lookups are loaded at the very beginning for Indexers. Changes of this PR don't affect the Indexer flow in any way.

For MM-less system, I'm not sure yet. @cryptoe Based on our earlier discussion, I have to setup the Kubernetes thing locally and try MM-less ingestion, and see what happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmless runs CliPeon as the entrypoint so it should work as intended

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Looking forward to results from MM less ingestion testing.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kfaraz kfaraz merged commit 9d2cae4 into apache:master Apr 29, 2024
87 checks passed
@Akshat-Jain
Copy link
Contributor Author

Thanks @kfaraz and @cryptoe for the exhaustive review! 🙌

@cryptoe cryptoe added this to the 30.0.0 milestone Apr 29, 2024
adarshsanjeev pushed a commit to adarshsanjeev/druid that referenced this pull request May 6, 2024
…e#16328)

Changes:
- Add `LookupLoadingSpec` to support 3 modes of lookup loading: ALL, NONE, ONLY_REQUIRED
- Add method `Task.getLookupLoadingSpec()`
- Do not load any lookups for `KillUnusedSegmentsTask`
kfaraz pushed a commit that referenced this pull request May 7, 2024
…ayer (#16328) (#16394)

Changes:
- Add `LookupLoadingSpec` to support 3 modes of lookup loading: ALL, NONE, ONLY_REQUIRED
- Add method `Task.getLookupLoadingSpec()`
- Do not load any lookups for `KillUnusedSegmentsTask`

Co-authored-by: Akshat Jain <akjn11@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants