-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augment statistics to note how many bytes are in duplicate lines due to replicas #9400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Added a question.
Nit: can you please give your PR a more descriptive title (and remove your username?)
pkg/iter/entry_iterator.go
Outdated
@@ -144,6 +145,7 @@ func (i *mergeEntryIterator) fillBuffer() { | |||
for _, t := range previous { | |||
if t.Entry.Line == entry.Line { | |||
i.stats.AddDuplicates(1) | |||
i.stats.AddDuplicateBytes(int64(len(entry.Line)) + 2*binary.MaxVarintLen64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the significance of 2*binary.MaxVarintLen64
here?
I see this repeated in a couple places; maybe a helper with an explanatory comment would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swiped the 2*binary.MaxVarintLen64 from memchunk.go. When adding the stats of decompressed bytes, that addition of ~20 bytes appears to account for the timestamp.
That being said, if not appropriate here, easy enough to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea TBH. It just triggered my wtf reflex 😛
I'd check in with the author of that line to see if we need it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends what we want to compute, but the byte processed stats was accounting for the line length and the timestamp for each line as those are encoded and read for memory chunks. The encoding used is varint, and I wanted to avoid encoding to know the size while reading so I went for the max number of bytes a varint 64 number can take.
I think after few years I realised we might want to be more precise and estimate better the size of those two. For instance we know the max line length (most likely below 10k) and timestamps are usually big (2535372481000000000
in year 2050)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that said, I think I'll remove that computation for now. Thank you for the input!
…an look into making one more exact as opposed to an approximation of the timestamp length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit
Clean up function naming to be more consistent Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
What this PR does / why we need it:
This PR is for counting the number of bytes of log lines that were marked as duplicates. This will be utilized to collect better statistics.
Which issue(s) this PR fixes:
We previously were not tracking this data.
Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)CHANGELOG.md
updateddocs/sources/upgrading/_index.md