Fix bugs, refactor, document and land the weighted consistent hashing branch #26

aarshkshah1992 · 2023-02-15T09:22:54Z

This PR aims to land #19 and make some changes on top of it.

The main changes include:

Take care of slowly re-fill the reputation of down-weighted backends as they successfully return content TODO in Use a more advanced weighting / fail-over strategy #19. The way we do this is by bumping up the weights of downvoted nodes that successfully return content by 20 percent on every success with a debounce until they hit the default weight of 20 that we assign to all nodes.
Addresses @lidel 's comment at Use a more advanced weighting / fail-over strategy #19 (comment) by ensuring we timeout requests to the Saturn L1s with a reasonable default.
I want to make some changes around how we refresh the pool after getting back a list of Saturn nodes from the Orchestrator and how we read the blocks in memory. Please take a look at Fix bugs, refactor, document and land the weighted consistent hashing branch #26 (comment) and Fix bugs, refactor, document and land the weighted consistent hashing branch #26 (comment) and let me know what you think.
I think I found and fixed a bug around how we were updating the weights. Please see Fix bugs, refactor, document and land the weighted consistent hashing branch #26 (comment)
Some refactoring to remove boilerplate and added documentation for people to grok what's happening in the code.
I'll add some comprehensive unit tests in follow up PRs as there's a lot of code here that needs testing.

Closes #27 Closes #25

aarshkshah1992 · 2023-02-15T12:36:09Z

caboose.go

 	// trigger early refreshes when pool size drops below this low watermark
 	PoolLowWatermark int
-	MaxConcurrency   int


This is never used.

aarshkshah1992 · 2023-02-15T12:36:18Z

caboose.go

 	PoolFailureDownvoteDebounce time.Duration
-	PoolMaxSize                 int


This is never used.

cmd/caboose/main.go

aarshkshah1992 · 2023-02-15T12:42:08Z

pool.go

-			m.Unlock()
+			nm := NewMember(m.url)
+			nm.replication = updateWeightF(m.replication)
+			nm.lastUpdate = time.Now()


We need to update the lastUpdate time on this newly constructed member and not on the old member as the pool will now be updated with this member and we want to retain the last updated time across weight updates.

i think change it on both - if a couple things time out simultaneously against the previous member, we want only one of them to do the update re-weighting

aarshkshah1992 · 2023-02-15T12:43:40Z

pool.go

+		// it owns and runs. We should probably just forget about the Saturn endpoints that were
+		// previously in the pool but are no longer being returned by the orchestrator. It's highly
+		// likely that the Orchestrator has deemed them to be non-functional/malicious.
+		// Let's just override the old pool with the new endpoints returned here.


@willscott @lidel

The orchestrator periodically tests the L1s to mantain a list of responsive L1s.

I've asked for access to https://github.com/filecoin-saturn/orchestrator to see what exactly is "checked".

I don't think we can trust Orchestrator to give us usefull L1s, we need to verify ourselves.

The tests Orchestrator performs are not end-to-end, use hardcoded CIDs, and if the first try fails, they allow cached result (so nodes that lost retrieval capability, but have test CIDs in cache are never penalized):

https://github.com/filecoin-saturn/orchestrator/blob/04b1f712ef612b340142f3e49e89a995e7e3154e/cron/health-check.js#L27

https://github.com/filecoin-saturn/orchestrator/blob/04b1f712ef612b340142f3e49e89a995e7e3154e/cron/random-health-check.js#L20-L32

Orchestrator opinion would be useful only if it would run a real IPFS node, add random data to IPFS, announce it on DHT, and then attempt to retrieve it via specific L1.

This means "nearby" has little value.
We need to fetch as many L1s we can, and then let the weighting logic to find the most useful ones.
Currently, we only get 25, but with https://github.com/filecoin-saturn/orchestrator/issues/83 we will be able to ask for more.

@aarshkshah1992 update: we can now use https://orchestrator.strn.pl/nodes/nearby?count=9999 to grab as many L1s as we need.

@lidel This make sense to me. I think you'll have to make this change in Bifrost GW as caboose uses whatever orchestrator API URL you pass to it. Caboose Config param you'll have to change is called OrchestratorEndpoint.

i think mostly this was: what happens if we get an empty list from the orchestrator because it has a bug / looses it's DB?
it doesn't seem to hurt much to keep using known, working nodes.
maybe we ask for some way for the orchestrator to more forcefully tell us to replace?

We could be asking for ?count=9999 and see all of them, and decrease weight of ones that were removed.

This way if Orchestrator returns an empty list, of only 4 for some reason, we are not left with a broken client – weights will be applied to all, so no net impact on orchestrator hiccups/misconfigurations.

aarshkshah1992 · 2023-02-15T12:47:26Z

pool.go

 	fb = time.Now()
 	code = resp.StatusCode
 	proto = resp.Proto
 	respReq = resp.Request
-	defer resp.Body.Close()
+
+	// TODO: What if the Saturn node is malicious ? We should have an upper bound on how many bytes we read here.
 	rb, err := io.ReadAll(resp.Body)


@willscott @lidel Is this a concern ?

@aarshkshah1992 yes, for block response, we don't expect blocks bigger than 4MiB (bitswap limit). Use io.LimitedReader to cap the amount of data read here, just to be safe. (and error if they send anything bigger)

While at it, there is no point in reading body or calcualting hash for responses other than HTTP 200.
Check if resp.StatusCode != http.StatusOK early, and fail fast, saving CPU and time.

Returning different error for timeout and different one for resp.StatusCode != http.StatusOK will allow us to apply different penalty (timeout means node is trying to find content, HTTP error is just a hard fail).

PR update with both these changes.

filecoin-saturn/caboose#26

pool.go

lidel · 2023-02-16T12:51:33Z

cmd/caboose/main.go

@@ -38,7 +38,7 @@ func main1() int {
 			}
 			out := args.Get(1)

-			oe, _ := url.Parse("https://orchestrator.strn.pl/nodes/nearby")
+			oe, _ := url.Parse("https://orchestrator.strn.pl/nodes/nearby?count=100")


This parameter should be moved inside of caboose asa a const, appended to the endpoint.
You don't want to ask bifrost-gateway to change URL every time you want to adjust strategy for refreshing L1s :)

I've made the entire OrchestratorEndpoint param an optional value where the default will be this URL with a 1000 nodes. So, you can totally skip passing this param from Bifrost.

pool.go

lidel · 2023-02-16T12:58:13Z

pool.go

+	if received > 0 && received <= maxBlockSize {
+		buf = buf[:received]
+	} else {
+		return nil, ErrBackendFailed


This should be a separate error – otherwise we will bang our heads why things do not work N years from now, if ever the block size policy changes (but fine to do it in #28)

Yeah will do it in #28

Co-authored-by: Marcin Rataj <lidel@lidel.org>

aarshkshah1992 · 2023-02-16T15:44:40Z

@willscott Test is failing because we moved the Saturn client to use https and looks like it does NOT trust the cert given by the httptest TLS server. I'll take a look but a test against the Saturn network runs fine.

lidel · 2023-02-16T16:38:40Z

@aarshkshah1992 I think this is why we have https://github.com/ipfs/bifrost-gateway/blob/main/blockstore.go#L82-L84

If you move this hack into caboose it should fix your cmd and help us clean up Saturn-related hack from bifrost-gateway
(but still apply the outer RoundTrripper, so user-agent override done in bifrost-gatewat still works)

willscott · 2023-02-16T17:10:12Z

pool.go

-			m.Unlock()
+			nm := NewMember(m.url)
+			nm.replication = updateWeightF(m.replication)
+			nm.lastUpdate = time.Now()


i think change it on both - if a couple things time out simultaneously against the previous member, we want only one of them to do the update re-weighting

willscott · 2023-02-16T17:11:41Z

pool.go

+		// it owns and runs. We should probably just forget about the Saturn endpoints that were
+		// previously in the pool but are no longer being returned by the orchestrator. It's highly
+		// likely that the Orchestrator has deemed them to be non-functional/malicious.
+		// Let's just override the old pool with the new endpoints returned here.


i think mostly this was: what happens if we get an empty list from the orchestrator because it has a bug / looses it's DB?
it doesn't seem to hurt much to keep using known, working nodes.
maybe we ask for some way for the orchestrator to more forcefully tell us to replace?

lidel · 2023-02-16T18:54:59Z

@aarshkshah1992 I've fixed go test in d2749d3, will now take it for a spin in bifrost-gateway.

filecoin-saturn/caboose#26

lidel

233eb8a was able to retrieve bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi, I am switching bifrost-gateway to this commit: ipfs-inactive/bifrost-gateway#39

@willscott ok to merge?

pool.go

re-find idx in lock

fix and document the weighted branch

bc16b02

aarshkshah1992 requested a review from hacdias February 15, 2023 09:22

aarshkshah1992 changed the title ~~[WIP] Fix bugs, refactor and document the weighted consistent hashing branch~~ [WIP] Fix bugs, refactor, document and land the weighted consistent hashing branch Feb 15, 2023

aarshkshah1992 marked this pull request as draft February 15, 2023 09:25

aarshkshah1992 removed the request for review from hacdias February 15, 2023 09:25

aarshkshah1992 added 3 commits February 15, 2023 15:19

timeout requests to Saturn node

c13c013

upvote members once they start returning reliable results

5fa37e1

fix Caboose docs

0f507d9

aarshkshah1992 commented Feb 15, 2023

View reviewed changes

caboose.go

PoolFailureDownvoteDebounce time.Duration

PoolMaxSize int

Copy link

Contributor Author

aarshkshah1992 Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never used.

aarshkshah1992 commented Feb 15, 2023

View reviewed changes

cmd/caboose/main.go Show resolved Hide resolved

aarshkshah1992 commented Feb 15, 2023

View reviewed changes

fix docs

0a2543a

aarshkshah1992 changed the title ~~[WIP] Fix bugs, refactor, document and land the weighted consistent hashing branch~~ Fix bugs, refactor, document and land the weighted consistent hashing branch Feb 15, 2023

aarshkshah1992 requested review from willscott, hacdias and lidel and removed request for hacdias February 15, 2023 12:50

aarshkshah1992 self-assigned this Feb 15, 2023

aarshkshah1992 marked this pull request as ready for review February 15, 2023 12:56

aarshkshah1992 mentioned this pull request Feb 15, 2023

Use a more advanced weighting / fail-over strategy #19

Merged

2 tasks

aarshkshah1992 requested a review from tchardin February 15, 2023 13:10

lidel added a commit to ipfs-inactive/bifrost-gateway that referenced this pull request Feb 15, 2023

feat: switch to weighted caboose

c9c784b

filecoin-saturn/caboose#26

lidel mentioned this pull request Feb 15, 2023

feat: switch to weighted caboose ipfs-inactive/bifrost-gateway#39

Merged

lidel added a commit to ipfs-inactive/bifrost-gateway that referenced this pull request Feb 15, 2023

feat: switch to weighted caboose

5c2daf7

filecoin-saturn/caboose#26

lidel reviewed Feb 16, 2023

View reviewed changes

pool.go Show resolved Hide resolved

lidel mentioned this pull request Feb 16, 2023

L1s return 500s most of the time filecoin-saturn/L1-node#222

Closed

address review

dbb4923

aarshkshah1992 mentioned this pull request Feb 16, 2023

Use https:// instead of http:// #27

Closed

lidel suggested changes Feb 16, 2023

View reviewed changes

aarshkshah1992 and others added 2 commits February 16, 2023 18:06

Update pool.go

a3d57bf

Co-authored-by: Marcin Rataj <lidel@lidel.org>

address review

1351bf7

willscott approved these changes Feb 16, 2023

View reviewed changes

lidel added 2 commits February 16, 2023 19:43

fix: metric on network error and explicit strn 500

40924f3

test: use https with self-signed cert

d2749d3

lidel added a commit to ipfs-inactive/bifrost-gateway that referenced this pull request Feb 16, 2023

feat: switch to weighted caboose

233eb8a

filecoin-saturn/caboose#26

lidel added a commit to ipfs-inactive/bifrost-gateway that referenced this pull request Feb 16, 2023

feat: switch to weighted caboose

d04b0a7

filecoin-saturn/caboose#26

lidel approved these changes Feb 16, 2023

View reviewed changes

lidel mentioned this pull request Feb 16, 2023

panic in pool.go:279 #29

Closed

lidel reviewed Feb 16, 2023

View reviewed changes

pool.go Show resolved Hide resolved

willscott and others added 2 commits February 16, 2023 14:52

re-find idx in lock

20d3d61

Merge pull request #30 from filecoin-saturn/fix/pool-panic

20e6963

re-find idx in lock

willscott merged commit 6180140 into weighted Feb 17, 2023

willscott deleted the feat/fix-document-weighted branch February 17, 2023 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs, refactor, document and land the weighted consistent hashing branch #26

Fix bugs, refactor, document and land the weighted consistent hashing branch #26

aarshkshah1992 commented Feb 15, 2023 •

edited by lidel

Loading

aarshkshah1992 Feb 15, 2023

aarshkshah1992 Feb 15, 2023

aarshkshah1992 Feb 15, 2023

aarshkshah1992 Feb 15, 2023

willscott Feb 16, 2023

aarshkshah1992 Feb 15, 2023

lidel Feb 15, 2023 •

edited

Loading

aarshkshah1992 Feb 16, 2023

willscott Feb 16, 2023

lidel Feb 16, 2023 •

edited

Loading

aarshkshah1992 Feb 15, 2023

lidel Feb 15, 2023 •

edited

Loading

aarshkshah1992 Feb 16, 2023

lidel Feb 16, 2023

aarshkshah1992 Feb 16, 2023

lidel Feb 16, 2023

aarshkshah1992 Feb 16, 2023

aarshkshah1992 commented Feb 16, 2023 •

edited

Loading

lidel commented Feb 16, 2023

willscott Feb 16, 2023

willscott Feb 16, 2023

lidel commented Feb 16, 2023

lidel left a comment •

edited

Loading

Fix bugs, refactor, document and land the weighted consistent hashing branch #26

Fix bugs, refactor, document and land the weighted consistent hashing branch #26

Conversation

aarshkshah1992 commented Feb 15, 2023 • edited by lidel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarshkshah1992 commented Feb 16, 2023 • edited Loading

lidel commented Feb 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel commented Feb 16, 2023

lidel left a comment • edited Loading

Choose a reason for hiding this comment

aarshkshah1992 commented Feb 15, 2023 •

edited by lidel

Loading

lidel Feb 15, 2023 •

edited

Loading

lidel Feb 16, 2023 •

edited

Loading

lidel Feb 15, 2023 •

edited

Loading

aarshkshah1992 commented Feb 16, 2023 •

edited

Loading

lidel left a comment •

edited

Loading