Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Add segment limit for native and streaming index tasks #7238

Open
justinborromeo opened this issue Mar 12, 2019 · 5 comments
Open

Comments

@justinborromeo
Copy link
Contributor

Motivation

To prevent IndexTasks and Kafka/KinesisIndexTasks from consuming excessive memory, it would be safer if there was a configurable limit on the total # of segments that can be opened by a single task. An example of when a situation like this might occur is if an indexing task has a low segmentGranularity (e.g. an hour) and a backfilled/late data stream that has a small number of events per hour over a large number of hours. Without any safeguards, the task would try to open an excessive number of segments and OOM due to the overhead of opening files.

Proposed changes

I plan on adding a maxTotalSegments field to AppenderatorConfig allowing users to set it in the ingestion spec.

Property Description Default Required?
maxTotalSegments The maximum number of mutable segments that an indexing task is allowed to have open at one time. 1000 No

If the number of mutable segments exceeds maxTotalSegments during indexing, the segments will be pushed to deep storage (BatchAppenderatorDriver#pushAllAndClear()). This behaviour is similar to what occurs if the number of rows exceeds maxTotalRows. For this to work, AppenderatorDriverAddResult#isPushRequired() should be modified to also check whether the number of segments is above the limit. This is feasible since the number of sinks (which have a 1:1 relation to mutable segments) is accessible from AppenderatorImpl#add() and can be returned as a new field in AppenderatorDriverAddResult.

Rationale

I don't think there's any other options for capping the number of segments opened by an indexing task/supervisor.

Operational impact

Addition of a field with a default to ingestion spec shouldn't have an operational impact.

@jihoonson
Copy link
Contributor

@justinborromeo thank you for the proposal!

It looks good to me, but I have one suggestion. We currently have maxTotalRows to suppress the total number of rows across all segments being generated at the same time. This sometimes works well, but I feel it's not easy to control because we don't know what its value means exactly (I'm still not sure what is a good default value for it). However, maxTotalSegments is quite intuitive and it looks like maxTotalRows = maxTotalSegments * maxRowsPerSegment. Then, maybe we can control maxTotalRows by setting maxTotalSegments and maxRowsPerSegment properly.

So, I would suggest to replace maxTotalRows with maxTotalSegments rather than having both. I think they would be used in similar use cases and less configurations would be simpler for tuning. What do you think?

@justinborromeo
Copy link
Contributor Author

justinborromeo commented Apr 8, 2019

@jihoonson I think maxRowsPerSegment, maxTotalRows, and maxTotalSegments are all necessary since they protect against different cases.

  • maxRowsPerSegment exists so that a specific segment doesn't have too much data
  • maxTotalRows exists for the case where there is a moderately high number of segments each with a large amount of data
  • maxTotalSegments exists for the cases when there is an extremely high number of segments with a small amount of data

As an example, if we use the maxTotalRows = maxTotalSegments * maxRowsPerSegment calculation with the existing 5M maxRowsPerSegment default and the proposed 1K maxTotalSegments default, our calculated maxTotalRows is 5B which seems a bit high.

Do you agree?

@justinborromeo
Copy link
Contributor Author

justinborromeo commented Apr 17, 2019

After taking a first crack at this, I've realized that publishing segments once the number of segments in the Appenderator exceeded some limit results in the limit row being stored in its one-row segment. This is because the proposed implementation checks number of segments after the row had been stored.

A better behaviour would involve checking whether a row will put the number of segments over the segment limit before the row has been stored to prevent the creation of one-row segments.

@stale
Copy link

stale bot commented Jun 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 20, 2019
@justinborromeo
Copy link
Contributor Author

Commenting to keep the issue open

@stale stale bot removed the stale label Jun 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants