Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking issue] http delegated routing used in production #9150

Closed
7 of 9 tasks
Tracked by #709
BigLep opened this issue Jul 25, 2022 · 29 comments
Closed
7 of 9 tasks
Tracked by #709

[Tracking issue] http delegated routing used in production #9150

BigLep opened this issue Jul 25, 2022 · 29 comments
Assignees

Comments

@BigLep
Copy link
Contributor

BigLep commented Jul 25, 2022

Done Criteria

  1. HTTP delegated routing is enabled on the ipfs.io gateway, hitting cid.contact in parallel to querying the DHT to determine providers for a CID.
  2. ipfs.io gateway operators are able to assess delegated routing calls health with metrics.

Notes

This corresponds with "stage 1" in https://www.notion.so/pl-strflt/Indexer-IPFS-Reframe-Q3-Plan-77d0f3fa193b438784d2d85096e315e7

It doesn't matter if STI gives empty results or if the results are unreachable because they're to Filecoin Storage Providers that don't allow data transfer with Bitswap. Enabling retrieval over Bitswap for Filecoin SP's is a separate effort.

Reframe shipped in Kubo 0.14 as part of #8775. That said, it isn't used in production that we know of. The purpose of this tracking issue is to track followups that we have already identified as needing to be done or that come up as a result of attempting production deployment.


Expanded scope for items observed while in production:

Other followup issues that should be done, but aren't directly tied to the done criteria.

@BigLep BigLep changed the title [Tracking issue] reframe support followups (post v1) [Tracking issue] reframe used in production Aug 8, 2022
@BigLep
Copy link
Contributor Author

BigLep commented Aug 19, 2022

2022-08-19 conversation:

Tasks involved in finishing this off:

  1. Polish off the design
  • Answer last remaining questions.
  • We didn't estimate this, but it should be low/quick.
  1. New routers (3 dev days)
  • New routers
    • Sequential router
    • Parallel router
    • Router to filter by method
  • May include refractoring
  • Includes unit tests
  • Code will live in https://github.com/libp2p/go-libp2p-routing-helpers
  • Going to use existing routers if possible so we have less code to maintain but if this turns into a headache with Lotus, then may need new version.
  1. Make this work in Kubo (5 dev days)
  • Config parsing
  • Generate Routing instances from config (simple approach)
  • Using FX
  1. Documentation (1 dev days)
  1. Sharness smoke test (1 dev day)
  • This will be adjusting the existing Sharness test we have.
  1. Operational preparation (1 dev days)
  • Need to answer how much time each of the routers are taking
  • Edelweiss: define metrics we want, RPC metrics are already there
  1. Testing on gateways (2 dev days)

Total estimate: 13 ideal dev days

Assuming a velocity of .7 ideal dev day to actual day, that puts us at 19 calendar days assuming just one person working on this.

Next steps here:

  • @BigLep communicate worst-case calendar date in #reframe
  • Clean up the existing issues
  • Create new issues that PRs can be linked against

@BigLep
Copy link
Contributor Author

BigLep commented Aug 20, 2022

@ajnavarro : I moved over the estimates from #9188 (comment) to here.

Can you help with the issue cleanup here? Specifically:

  1. I think we need a new issue in go-libp2p-routing-helpers for the new routers.
  2. I assume we'll update Routing: conform DHT to common Router interface #9079 to account for the design changes. This is "Make this work in Kubo" above right or are you going to create another issue?
  3. I assume Routing: priority and mandatory Router Params #9083 and Delegated Routing: Allow to configure methods used by router.  #9157 can be closed in favor of the work happening in # 2 above, is that right?

@ajnavarro
Copy link
Member

@BigLep
Copy link
Contributor Author

BigLep commented Aug 22, 2022

@ajnavarro : thanks. I updated "# 3" above. I believe the confusion stemmed from the fact that I wrote "#2" which got translated to issue "#2" in the repo.

@ajnavarro
Copy link
Member

@BigLep yep, we can close #9083

@BigLep
Copy link
Contributor Author

BigLep commented Aug 23, 2022

As communicated in Slack:

Total estimate is 13 ideal dev days
Assuming a velocity of .7 ideal dev day to actual day, that puts us at 19 calendar days assuming just one person working on this.
That feels high to me, but I think Bedrock should plan around that for now given we'll likely experience a jolt when Adin heads out, there is ongoing Kubo maintenance efforts that the team gets pulled into, and I don't think we'll be able to have an extra engineer on this beyond helping with code reviews.

@BigLep
Copy link
Contributor Author

BigLep commented Aug 23, 2022

@ajnavarro : should #9157 combine into #9079 or will the config work be a separate PR. If it will be separate then I agree we should have two issues.

@ajnavarro
Copy link
Member

@BigLep I think we can do them separately.

@ajnavarro
Copy link
Member

  • Config parser with tests: Done
  • Create Router instances from validated config: Done
  • Integrate with fx: In progress

@ajnavarro
Copy link
Member

@lidel
Copy link
Member

lidel commented Oct 10, 2022

I was not sure where "productizing" https://routing.delegate.ipfs.io/reframe should live, so I filled https://github.com/protocol/bifrost-infra/issues/2142

@BigLep
Copy link
Contributor Author

BigLep commented Oct 17, 2022

@guseggert : I added the important area to the done criteria we were missing last week of "ipfs.io gateway operators are able to assess reframe call health with metrics". Feel free to create a separate tracking issue for this work or do it here.

@BigLep BigLep changed the title [Tracking issue] reframe used in production [Tracking issue] http delegated routing used in production Nov 29, 2022
@BigLep
Copy link
Contributor Author

BigLep commented Nov 29, 2022

I've expanded the done checklist to account for the migration from reframe to the updated HTTP delegated routing API: ipfs/go-delegated-routing#63 . I have updated the title as well.

@ajnavarro ajnavarro removed their assignment Feb 5, 2023
@BigLep
Copy link
Contributor Author

BigLep commented Mar 24, 2023

Pending verification that clientside metrics are flowing for ipfs.io gateway and cid.contact in https://github.com/protocol/bifrost-infra/issues/2183

@BigLep
Copy link
Contributor Author

BigLep commented Apr 25, 2023

2023-04-25 maintainer conversation:
we're still trying to confirm that the ipfs.io gateway is calling cid.contact.
cid.contact is not seeing any requests from ipfs.io gateways per discussion in ipni/storetheindex#1597 (comment) and https://filecoinproject.slack.com/archives/C03RQFURZLM/p1682348755904599 .
@guseggert SSH'd to a box and confirmed it is using "autoclient"
@guseggert confirmed that when he uses the same version of Kubo with "autoclient" that cid.contact is being called
@guseggert is leading the investigation here

@willscott
Copy link
Contributor

I suspect https://github.com/ipfs/boxo/blob/main/routing/http/client/transport.go#L24-L28 could be an issue:

@guseggert
Copy link
Contributor

guseggert commented Apr 25, 2023

if the response body is not fully drained, the http client will not be able to re-use the connection. eventually you'll run out of the pool of clients and I think the logic then is to fail silently.

IIUC the body is not fully drained in the case where the limit is reached, which is indeed a bug, but are there actually records in cid.contact with >1 MiB of providers?

@guseggert
Copy link
Contributor

guseggert commented Apr 25, 2023

My other theory is that it is related to the accelerated DHT client. When I run locally with FullRT client, no reqs are made to cid.contact.

@willscott
Copy link
Contributor

  • there may be records with >1MiB of response

bitswap issues a query for 10 providers and the accelerated DHT returns 20 immediately, terminating the query before it even starts on cid.contact.

I don't think this explains things - we have cases of providers that only are publishing to cid.contact.
there wouldn't be 20 peers already in the DHT there, and we don't find them on gateways.

Just because you have some peers from FullRT you don't need to close the channel / cancel the query on the indexer side, right?

(in the same way that when indexers gets requests requesting a cascade to the DHT they don't cancel the dht cascade once we've returned responses from the index.)

@willscott
Copy link
Contributor

~30% of queries to kubo on gateways end up timing out with no peers, so we should still see those at minimum reaching the indexers, right?

@guseggert
Copy link
Contributor

Okay there's a plumbing bug in Kubo's rats nest of content routing plumbing that is causing this issue, when the experimental DHT client is turned on then the cid.contact router is not used. Working on a fix, could take me a few hours to untangle that mess.

@BigLep
Copy link
Contributor Author

BigLep commented Apr 26, 2023

Thanks @guseggert for the update.

As you are working on this, friendly reminder that we plan to move "accelerated DHT" out of experimental in the next release. If there are followups that need to occur, fee free to drop them to #9703

In terms of releasing this, I assume we'll have it part of 0.20? (We'll talk during 2023-04-27 standup. Alternative is that Bifrost uses a custom branch until this come in a future release.)

A challenge here will be code review given team members being out. I think our best case is that @lidel take a look at it Thursday, 2023-04-27.

@willscott
Copy link
Contributor

i would push to get an RC onto bifrost and validate things are working as expected before releasing

@guseggert
Copy link
Contributor

Agreed we should one-box this on the gateways before releasing to make sure it works, otherwise the feedback loop is brutal

@BigLep
Copy link
Contributor Author

BigLep commented Apr 27, 2023

2023-04-27 maintainer conversation: this is still the top priority of @guseggert that he is actively working on. We don't have a PR yet.

@guseggert
Copy link
Contributor

Okay I have working code now and can confirm that the default HTTP routers are being invoked alongside the FullRT client, will open a PR shortly and the plan is for @Jorropo to take a look on Monday

@BigLep
Copy link
Contributor Author

BigLep commented Apr 28, 2023

Thanks @guseggert . Could you please also give directions on what Bifrost would need to test it? I'd like to get their test/verification happening in parallel.

@BigLep
Copy link
Contributor Author

BigLep commented May 2, 2023

2023-05-02 update: The fix has deployed to "one box" on ipfs.io gateway. Integration looks good. These are going to get rolled out to the whole fleet over the next day or two.

@BigLep
Copy link
Contributor Author

BigLep commented May 5, 2023

I'm closing this issue because ipfs.io gateway <> cid.contact integration is rolled out (see https://github.com/protocol/bifrost-infra/issues/2183#issuecomment-1533342956) and confirmed to be working.

@BigLep BigLep closed this as completed May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants