-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROPOSAL] Add segment limit for native and streaming index tasks #7238
Comments
@justinborromeo thank you for the proposal! It looks good to me, but I have one suggestion. We currently have So, I would suggest to replace |
@jihoonson I think
As an example, if we use the Do you agree? |
After taking a first crack at this, I've realized that publishing segments once the number of segments in the Appenderator exceeded some limit results in the limit row being stored in its one-row segment. This is because the proposed implementation checks number of segments after the row had been stored. A better behaviour would involve checking whether a row will put the number of segments over the segment limit before the row has been stored to prevent the creation of one-row segments. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Commenting to keep the issue open |
Motivation
To prevent IndexTasks and Kafka/KinesisIndexTasks from consuming excessive memory, it would be safer if there was a configurable limit on the total # of segments that can be opened by a single task. An example of when a situation like this might occur is if an indexing task has a low segmentGranularity (e.g. an hour) and a backfilled/late data stream that has a small number of events per hour over a large number of hours. Without any safeguards, the task would try to open an excessive number of segments and OOM due to the overhead of opening files.
Proposed changes
I plan on adding a
maxTotalSegments
field toAppenderatorConfig
allowing users to set it in the ingestion spec.If the number of mutable segments exceeds
maxTotalSegments
during indexing, the segments will be pushed to deep storage (BatchAppenderatorDriver#pushAllAndClear()
). This behaviour is similar to what occurs if the number of rows exceedsmaxTotalRows
. For this to work,AppenderatorDriverAddResult#isPushRequired()
should be modified to also check whether the number of segments is above the limit. This is feasible since the number of sinks (which have a 1:1 relation to mutable segments) is accessible fromAppenderatorImpl#add()
and can be returned as a new field inAppenderatorDriverAddResult
.Rationale
I don't think there's any other options for capping the number of segments opened by an indexing task/supervisor.
Operational impact
Addition of a field with a default to ingestion spec shouldn't have an operational impact.
The text was updated successfully, but these errors were encountered: