[PROPOSAL] Add segment limit for native and streaming index tasks #7238

justinborromeo · 2019-03-12T02:58:32Z

Motivation

To prevent IndexTasks and Kafka/KinesisIndexTasks from consuming excessive memory, it would be safer if there was a configurable limit on the total # of segments that can be opened by a single task. An example of when a situation like this might occur is if an indexing task has a low segmentGranularity (e.g. an hour) and a backfilled/late data stream that has a small number of events per hour over a large number of hours. Without any safeguards, the task would try to open an excessive number of segments and OOM due to the overhead of opening files.

Proposed changes

I plan on adding a maxTotalSegments field to AppenderatorConfig allowing users to set it in the ingestion spec.

Property	Description	Default	Required?
maxTotalSegments	The maximum number of mutable segments that an indexing task is allowed to have open at one time.	1000	No

If the number of mutable segments exceeds maxTotalSegments during indexing, the segments will be pushed to deep storage (BatchAppenderatorDriver#pushAllAndClear()). This behaviour is similar to what occurs if the number of rows exceeds maxTotalRows. For this to work, AppenderatorDriverAddResult#isPushRequired() should be modified to also check whether the number of segments is above the limit. This is feasible since the number of sinks (which have a 1:1 relation to mutable segments) is accessible from AppenderatorImpl#add() and can be returned as a new field in AppenderatorDriverAddResult.

Rationale

I don't think there's any other options for capping the number of segments opened by an indexing task/supervisor.

Operational impact

Addition of a field with a default to ingestion spec shouldn't have an operational impact.

The text was updated successfully, but these errors were encountered:

jihoonson · 2019-04-02T23:13:28Z

@justinborromeo thank you for the proposal!

It looks good to me, but I have one suggestion. We currently have maxTotalRows to suppress the total number of rows across all segments being generated at the same time. This sometimes works well, but I feel it's not easy to control because we don't know what its value means exactly (I'm still not sure what is a good default value for it). However, maxTotalSegments is quite intuitive and it looks like maxTotalRows = maxTotalSegments * maxRowsPerSegment. Then, maybe we can control maxTotalRows by setting maxTotalSegments and maxRowsPerSegment properly.

So, I would suggest to replace maxTotalRows with maxTotalSegments rather than having both. I think they would be used in similar use cases and less configurations would be simpler for tuning. What do you think?

justinborromeo · 2019-04-08T23:15:48Z

@jihoonson I think maxRowsPerSegment, maxTotalRows, and maxTotalSegments are all necessary since they protect against different cases.

maxRowsPerSegment exists so that a specific segment doesn't have too much data
maxTotalRows exists for the case where there is a moderately high number of segments each with a large amount of data
maxTotalSegments exists for the cases when there is an extremely high number of segments with a small amount of data

As an example, if we use the maxTotalRows = maxTotalSegments * maxRowsPerSegment calculation with the existing 5M maxRowsPerSegment default and the proposed 1K maxTotalSegments default, our calculated maxTotalRows is 5B which seems a bit high.

Do you agree?

justinborromeo · 2019-04-17T21:44:58Z

After taking a first crack at this, I've realized that publishing segments once the number of segments in the Appenderator exceeded some limit results in the limit row being stored in its one-row segment. This is because the proposed implementation checks number of segments after the row had been stored.

A better behaviour would involve checking whether a row will put the number of segments over the segment limit before the row has been stored to prevent the creation of one-row segments.

stale · 2019-06-20T18:30:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

justinborromeo · 2019-06-20T19:05:51Z

Commenting to keep the issue open

justinborromeo added Design Review Proposal labels Mar 12, 2019

jihoonson added Area - Batch Ingestion Area - Streaming Ingestion labels Mar 27, 2019

justinborromeo mentioned this issue Apr 9, 2019

[WIP] Indexing Task Segment Limit #7434

Closed

justinborromeo mentioned this issue Apr 19, 2019

Introduce segment limit for indexing tasks #7514

Closed

stale bot added the stale label Jun 20, 2019

stale bot removed the stale label Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Add segment limit for native and streaming index tasks #7238

[PROPOSAL] Add segment limit for native and streaming index tasks #7238

justinborromeo commented Mar 12, 2019

jihoonson commented Apr 2, 2019

justinborromeo commented Apr 8, 2019 •

edited

Loading

justinborromeo commented Apr 17, 2019 •

edited

Loading

stale bot commented Jun 20, 2019

justinborromeo commented Jun 20, 2019

[PROPOSAL] Add segment limit for native and streaming index tasks #7238

[PROPOSAL] Add segment limit for native and streaming index tasks #7238

Comments

justinborromeo commented Mar 12, 2019

Motivation

Proposed changes

Rationale

Operational impact

jihoonson commented Apr 2, 2019

justinborromeo commented Apr 8, 2019 • edited Loading

justinborromeo commented Apr 17, 2019 • edited Loading

stale bot commented Jun 20, 2019

justinborromeo commented Jun 20, 2019

justinborromeo commented Apr 8, 2019 •

edited

Loading

justinborromeo commented Apr 17, 2019 •

edited

Loading