Augment statistics to note how many bytes are in duplicate lines due to replicas #9400

paul1r · 2023-05-04T20:58:57Z

What this PR does / why we need it:
This PR is for counting the number of bytes of log lines that were marked as duplicates. This will be utilized to collect better statistics.

Which issue(s) this PR fixes:
We previously were not tracking this data.

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

…found

… house

dannykopping

Overall LGTM! Added a question.

Nit: can you please give your PR a more descriptive title (and remove your username?)

dannykopping · 2023-05-05T06:03:41Z

pkg/iter/entry_iterator.go

@@ -144,6 +145,7 @@ func (i *mergeEntryIterator) fillBuffer() {
 		for _, t := range previous {
 			if t.Entry.Line == entry.Line {
 				i.stats.AddDuplicates(1)
+				i.stats.AddDuplicateBytes(int64(len(entry.Line)) + 2*binary.MaxVarintLen64)


What's the significance of 2*binary.MaxVarintLen64 here?
I see this repeated in a couple places; maybe a helper with an explanatory comment would be useful.

I swiped the 2*binary.MaxVarintLen64 from memchunk.go. When adding the stats of decompressed bytes, that addition of ~20 bytes appears to account for the timestamp.

That being said, if not appropriate here, easy enough to remove.

I have no idea TBH. It just triggered my wtf reflex 😛
I'd check in with the author of that line to see if we need it here.

Depends what we want to compute, but the byte processed stats was accounting for the line length and the timestamp for each line as those are encoded and read for memory chunks. The encoding used is varint, and I wanted to avoid encoding to know the size while reading so I went for the max number of bytes a varint 64 number can take.

I think after few years I realised we might want to be more precise and estimate better the size of those two. For instance we know the max line length (most likely below 10k) and timestamps are usually big (2535372481000000000 in year 2050)

With that said, I think I'll remove that computation for now. Thank you for the input!

…an look into making one more exact as opposed to an approximation of the timestamp length

dannykopping

LGTM, one nit

pkg/logqlmodel/stats/context.go

Clean up function naming to be more consistent Co-authored-by: Danny Kopping <danny.kopping@grafana.com>

…nes due to replicas (#9400)" This reverts commit 1671751.

paul1r added 5 commits May 3, 2023 15:05

First pass at adding how many bytes were tossed when duplicates were …

9f7d27a

…found

Make sure to log the new metric

ce99fa2

Add line length information for duplicates for the sample side of the…

38a1d7c

… house

Update tests to account for new field in stats

ac7983c

Merge branch 'main' into paul1r/track_duplicate_bytes

18a70e1

paul1r requested a review from a team as a code owner May 4, 2023 20:58

pull-request-size bot added the size/M label May 4, 2023

dannykopping reviewed May 5, 2023

View reviewed changes

Remove computation related to timestamp. If we decide it is needed, c…

dfd73fd

…an look into making one more exact as opposed to an approximation of the timestamp length

paul1r changed the title ~~Paul1r/track duplicate bytes~~ Augment statistics to note how many bytes are in duplicate lines due to replicas May 5, 2023

Merge branch 'main' into paul1r/track_duplicate_bytes

0347ad1

dannykopping approved these changes May 8, 2023

View reviewed changes

pkg/logqlmodel/stats/context.go Outdated Show resolved Hide resolved

paul1r and others added 2 commits May 8, 2023 08:44

Update pkg/logqlmodel/stats/context.go

6303c2e

Clean up function naming to be more consistent Co-authored-by: Danny Kopping <danny.kopping@grafana.com>

Merge branch 'main' into paul1r/track_duplicate_bytes

2048c1c

dannykopping merged commit 1671751 into main May 8, 2023

dannykopping deleted the paul1r/track_duplicate_bytes branch May 8, 2023 12:59

paul1r added a commit that referenced this pull request May 9, 2023

Revert "Augment statistics to note how many bytes are in duplicate li…

3ae6308

…nes due to replicas (#9400)" This reverts commit 1671751.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Augment statistics to note how many bytes are in duplicate lines due to replicas #9400

Augment statistics to note how many bytes are in duplicate lines due to replicas #9400

paul1r commented May 4, 2023 •

edited

Loading

dannykopping left a comment •

edited

Loading

dannykopping May 5, 2023

paul1r May 5, 2023

dannykopping May 5, 2023

cyriltovena May 5, 2023

paul1r May 5, 2023

dannykopping left a comment

Augment statistics to note how many bytes are in duplicate lines due to replicas #9400

Augment statistics to note how many bytes are in duplicate lines due to replicas #9400

Conversation

paul1r commented May 4, 2023 • edited Loading

dannykopping left a comment • edited Loading

Choose a reason for hiding this comment

dannykopping May 5, 2023

Choose a reason for hiding this comment

paul1r May 5, 2023

Choose a reason for hiding this comment

dannykopping May 5, 2023

Choose a reason for hiding this comment

cyriltovena May 5, 2023

Choose a reason for hiding this comment

paul1r May 5, 2023

Choose a reason for hiding this comment

dannykopping left a comment

Choose a reason for hiding this comment

paul1r commented May 4, 2023 •

edited

Loading

dannykopping left a comment •

edited

Loading