Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MSQ query context maxNumSegments #16637

Merged
merged 20 commits into from
Jun 26, 2024

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Jun 21, 2024

This patch:

  • Adds an MSQ query context parameter maxNumSegments that acts as a guard rail on the maximum number of segments that can be generated by an ingestion task per time chunk. A similar parameter exists in the split hint spec supported by the native engine.
  • The default value for this context is null -- i.e., no restriction on the number of segments per time chunk. I considered Integer.MAX_INT, but null feels more natural.
  • If set or overridden by a user, and a time chunk contains more segments than the number specified in the query context, the MSQ ingestion task will fail with a TooManySegmentsInTimeChunkFault .

Most users won't need to set this config, but this will be useful in the following scenarios:

  1. Testing/verification that only a specific # of segments gets created per time chunk
  2. Implementing any guard rails in the ingestion flow

I intentionally left the context parameter and the fault undocumented. We can document it once we have a wider use case. Similarly, no release note is required for this patch.

This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

- Default is MAX_INT (unbounded).
- When set and if a time chunk contains more number of segments than set in the
  query context, the MSQ task will fail with TooManySegments fault.
@abhishekrb19 abhishekrb19 marked this pull request as draft June 21, 2024 22:02
@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jun 21, 2024
@abhishekrb19 abhishekrb19 marked this pull request as ready for review June 26, 2024 04:23
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.
Overall lgtm.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!!

@abhishekrb19
Copy link
Contributor Author

Thanks for the review, @cryptoe. I renamed the validation function and also included granularity info in the error message.

@abhishekrb19 abhishekrb19 merged commit 82117e8 into apache:master Jun 26, 2024
58 checks passed
@abhishekrb19 abhishekrb19 deleted the msq_max_num_segments branch June 26, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants