decouple column serializer compression closers from SegmentWriteOutMedium #16076
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR modifies column serializers that use compression to split out the
Closer
so that we don't have to use the one fromSegmentWriteoutMedium.getCloser()
, allowing serializers to optionally release the direct memory that most compression strategies allocateSegmentWriteOutMediummuch earlier than when the segment serialization is completed.Most callers still use
SegmentWriteOutMedium.getCloser()
because it doesn't matter too much, but this change allows nested column serialization to use a separate closer associated with an individualGlobalDictionaryEncodedFieldColumnWriter
(one of these for each field) to release as soon as the nested field is written instead of holding on to them until the entire segment is finished. The main situation this will impact is when there are a very large number of nested fields, as the relatively small amount of direct memory used by each individual buffer (usually 64kb or so) could quickly add up when there are thousands of paths.I wasn't able to add any direct tests to confirm the buffers are freed prior to the
SegmentWriteOutMedium
itself being closed because all of this stuff is tucked pretty far away and would have to have some gross interface changes to push in an observable closer, but, did at least confirm in the debugger that this is in fact happening. (The same is true of the 'temp' writeout medium these nested columns use to also release the temp files that are no longer needed after the field is serialized, this new closer is closed in the same place). This code path is hit during tests, so existing passing tests should indicate that this change had no negative effect on serialization.Release note
Nested column serialization now releases nested field compression buffers as soon as the nested field serialization is completed, which should require significantly less direct memory during segment serialization when many nested fields are present.
This PR has: