Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decouple column serializer compression closers from SegmentWriteOutMedium #16076

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 8, 2024

Description

This PR modifies column serializers that use compression to split out the Closer so that we don't have to use the one from SegmentWriteoutMedium.getCloser(), allowing serializers to optionally release the direct memory that most compression strategies allocateSegmentWriteOutMediummuch earlier than when the segment serialization is completed.

Most callers still use SegmentWriteOutMedium.getCloser() because it doesn't matter too much, but this change allows nested column serialization to use a separate closer associated with an individual GlobalDictionaryEncodedFieldColumnWriter (one of these for each field) to release as soon as the nested field is written instead of holding on to them until the entire segment is finished. The main situation this will impact is when there are a very large number of nested fields, as the relatively small amount of direct memory used by each individual buffer (usually 64kb or so) could quickly add up when there are thousands of paths.

I wasn't able to add any direct tests to confirm the buffers are freed prior to the SegmentWriteOutMedium itself being closed because all of this stuff is tucked pretty far away and would have to have some gross interface changes to push in an observable closer, but, did at least confirm in the debugger that this is in fact happening. (The same is true of the 'temp' writeout medium these nested columns use to also release the temp files that are no longer needed after the field is serialized, this new closer is closed in the same place). This code path is hit during tests, so existing passing tests should indicate that this change had no negative effect on serialization.

Release note

Nested column serialization now releases nested field compression buffers as soon as the nested field serialization is completed, which should require significantly less direct memory during segment serialization when many nested fields are present.


This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • been tested in a test Druid cluster.

…dium to optionally allow serializers to release direct memory allocated for compression earlier than when segment is completed
@clintropolis clintropolis force-pushed the nested-release-buffers-during-serialize branch from 0c49bda to b053645 Compare March 8, 2024 00:21
@clintropolis clintropolis merged commit 313da98 into apache:master Mar 11, 2024
83 checks passed
@clintropolis clintropolis deleted the nested-release-buffers-during-serialize branch March 11, 2024 19:28
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants