Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augment statistics to note how many bytes are in duplicate lines due to replicas #9400

Merged
merged 9 commits into from
May 8, 2023

Conversation

paul1r
Copy link
Collaborator

@paul1r paul1r commented May 4, 2023

What this PR does / why we need it:
This PR is for counting the number of bytes of log lines that were marked as duplicates. This will be utilized to collect better statistics.

Which issue(s) this PR fixes:
We previously were not tracking this data.

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@paul1r paul1r requested a review from a team as a code owner May 4, 2023 20:58
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Added a question.

Nit: can you please give your PR a more descriptive title (and remove your username?)

@@ -144,6 +145,7 @@ func (i *mergeEntryIterator) fillBuffer() {
for _, t := range previous {
if t.Entry.Line == entry.Line {
i.stats.AddDuplicates(1)
i.stats.AddDuplicateBytes(int64(len(entry.Line)) + 2*binary.MaxVarintLen64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the significance of 2*binary.MaxVarintLen64 here?
I see this repeated in a couple places; maybe a helper with an explanatory comment would be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swiped the 2*binary.MaxVarintLen64 from memchunk.go. When adding the stats of decompressed bytes, that addition of ~20 bytes appears to account for the timestamp.

That being said, if not appropriate here, easy enough to remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea TBH. It just triggered my wtf reflex 😛
I'd check in with the author of that line to see if we need it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends what we want to compute, but the byte processed stats was accounting for the line length and the timestamp for each line as those are encoded and read for memory chunks. The encoding used is varint, and I wanted to avoid encoding to know the size while reading so I went for the max number of bytes a varint 64 number can take.

I think after few years I realised we might want to be more precise and estimate better the size of those two. For instance we know the max line length (most likely below 10k) and timestamps are usually big (2535372481000000000 in year 2050)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that said, I think I'll remove that computation for now. Thank you for the input!

…an look into making one more exact as opposed to an approximation of the timestamp length
@paul1r paul1r changed the title Paul1r/track duplicate bytes Augment statistics to note how many bytes are in duplicate lines due to replicas May 5, 2023
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

pkg/logqlmodel/stats/context.go Outdated Show resolved Hide resolved
paul1r and others added 2 commits May 8, 2023 08:44
Clean up function naming to be more consistent

Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping merged commit 1671751 into main May 8, 2023
@dannykopping dannykopping deleted the paul1r/track_duplicate_bytes branch May 8, 2023 12:59
paul1r added a commit that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants