Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

Metrics Refactor and record metrics around L1 Server-Timings #70

Merged
merged 5 commits into from
Mar 31, 2023

Conversation

aarshkshah1992
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 commented Mar 31, 2023

Some refactor/fixes of the metrics code for correctness/grokability/dryness
Add more dimensions/labels to metrics to better capture block/car, cache-hit/cache-miss cases
Record prometheus metrics around important L1 server-timings

pool.go Outdated
@@ -17,6 +17,10 @@ import (
"github.com/serialx/hashring"
)

var (
maxPoolSize = 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a const rather than a var?

pool.go Outdated
goLogger.Infow("trimmed pool list to 200", "first", n[0].url, "first_weight",
n[0].weight, "last", n[199].url, "last_weight", n[199].weight)
n = n[:maxPoolSize]
goLogger.Infow(fmt.Sprintf("trimmed pool size to %d", maxPoolSize), "first", n[0].url, "first_weight",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the combination of infow and sprintf? can't you have "size", maxPoolSize as another pair of keys directly to infoW?

Help: "Latency observed during failed caboose car fetches from a single peer",
Buckets: durationPerCarHistogram,
fetchDurationBlockFailureMetric = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: prometheus.BuildFQName("ipfs", "caboose", "fetch_duration_block_failure"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming current metrics will break current dashboards - why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott Not all the metrics have been renamed. Only some that were either incorrectly named or some that have taken the place of some repetitive ones.

I'll ensure the Caboose/Bifrost dashboard/metrics encompass the correct names. We're anyways revamping both the dashboards.

fetcher.go Outdated
@@ -284,6 +284,11 @@ func updateSuccessServerTimingMetrics(timingHeaders []string, resourceType strin
fetchDurationPerPeerSuccessCacheMissTotalLassieMetric.WithLabelValues(resourceType).Observe(float64(m.Duration.Milliseconds()))
case "nginx":
fetchDurationPerPeerSuccessTotalL1NodeMetric.WithLabelValues(resourceType, cache_status).Observe(float64(m.Duration.Milliseconds()))
networkTimeMs := totalTimeMs - m.Duration.Milliseconds()
fetchNetworkSpeedPerPeerSuccessMetric.WithLabelValues(resourceType).Observe(float64(recieved) / float64(networkTimeMs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about when networkTimeMs is 0? - that leads to a division by 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and added more sanity checks.

@willscott willscott merged commit 3c2f141 into main Mar 31, 2023
@willscott willscott deleted the feat/metrics-verification branch March 31, 2023 14:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants