Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

Cool down cid failures #52

Merged
merged 4 commits into from
Mar 1, 2023

Conversation

aarshkshah1992
Copy link
Contributor

fixes #44

@aarshkshah1992 aarshkshah1992 requested review from willscott and removed request for willscott February 27, 2023 10:52
@aarshkshah1992 aarshkshah1992 changed the title Cool down cid failures [WIP] Cool down cid failures Feb 27, 2023
@aarshkshah1992 aarshkshah1992 changed the title [WIP] Cool down cid failures Cool down cid failures Feb 27, 2023
@@ -66,20 +75,38 @@ const DefaultSaturnGlobalBlockFetchTimeout = 60 * time.Second
const maxBlockSize = 4194305 // 4 Mib + 1 byte
const DefaultOrchestratorEndpoint = "https://orchestrator.strn.pl/nodes/nearby?count=1000"
const DefaultPoolRefreshInterval = 5 * time.Minute
const DefaultMaxCidFailures = 3
const DefaultCidCoolDownDuration = 10 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel may have thoughts on what our negative cache should be here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the half of the pool seems to fail to fetch some CID(s) every ~10-15m and gets punished for it (from bifrost-gw-staging-metrics):

Screenshot 2023-02-28 at 21-08-28 bifrost-gw staging metrics - Bifrost - Dashboards - Grafana

Not sure what is the right number, but probably way shorter than 10m. Maybe play it safe and cooldown for 1m for now?
We could increase it up to 5min, but having these spikes and cooldown longer than DefaultPoolRefreshInterval feels like a recipe for running out of useful ones before refresh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue this discussion in #59

@@ -50,6 +50,10 @@ type pool struct {
refresh chan struct{} // refresh is used to signal the need for doing a refresh of the Saturn endpoints pool.
done chan struct{} // done is used to signal that we're shutting down the Saturn endpoints pool and don't need to refresh it anymore.

cidLk sync.RWMutex
cidFailureCache *cache.Cache // guarded by cidLk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go-cache says:

Its major advantage is that, being essentially a thread-safe map[string]interface{}

can we use it's internal thread safety rather than needing an explicit lock for accesses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott I want both cidFailureCache and cidCoolDownCache to be updated atomically.

@aarshkshah1992 aarshkshah1992 changed the base branch from main to feat/better-downvoting March 1, 2023 06:35
@aarshkshah1992 aarshkshah1992 merged commit 5d900a9 into feat/better-downvoting Mar 1, 2023
@aarshkshah1992 aarshkshah1992 deleted the feat/cid-failure-cooldown branch March 1, 2023 06:53
@aarshkshah1992
Copy link
Contributor Author

Let's continue this discussion in #59.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cool-down amplification of non-existant CIDs
3 participants