Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lower 404 ttl to decrease end user failures #1344

Merged
merged 1 commit into from
Mar 7, 2023
Merged

Conversation

lidel
Copy link
Contributor

@lidel lidel commented Mar 1, 2023

TLDR

Amazon Cloudfront HTTP caching of false-negative CID lookups is DoS-ing all Saturn L1s using Lassie.

Context for users trying to access content via ipfs.io gateway

Any hiccup in content routing of a CID is cached for 5 minutes, no L11 can retrieve it, gateway can't return it.

Context for users (developers) running their own IPFS node / trying IPFS for the first time

  1. The user adds content to IPFS node, then tries to open it in a browser before sharing link with a friend.
    • 👉 This is what every IPFS user was trained to do for the past 7 years, and tools like IPFS Companion and Brave copy shareable links to clipboard, so user can paste and load CID via gateway within seconds.
    • 👉 This is also what every developer, CTO making decisions / evaluating IPFS will do in the first hour of playing with IPFS.
  2. In Rhea gateways backed by Saturn L1s IPNI DHT proxy is used for content routing, and that creates UX papercut:
    • Different Saturn L1s will ask IPNI for the same CID multiple times.
    • If a user tries to load content too fast, and one L1+Lassie fails to find providers, Caboose (Saturn Client) will retry using a different L1, and that will also fail due to cached 404.
    • User will try to refresh the page, and it will fail again because there is no L1 that can find providers until cached 404 expires (300s)
  • 👉 End result? User is unable to fetch content which is available on DHT for the next 5 minutes, and will report or form an opinion that IPFS DHT is slow / flakey, while in reality, it was HTTP cache on centralized service set too high.

Proposed Changes

Lowering cloudfront cache TTL for 404 errors to 5s will fix false-negative content routing errors for end users.
It should still protect you from unwanted load spikes, but the end user will be able to refresh the page without waiting 5 minutes to see their content.

Happy to discuss other values, but 5 minutes is way too high:
majority of users wont wait and retry after 5 minutes, they will just give up on IPFS.

Tests

Announcing a single block on DHT and then asking indexer for it sometimes produces 404, and that is cached for 5 minutes, artificially breaking content routing resolution for that CID on Rhea.

$ ID=$(date | ipfs block put -f v0) && sleep 5 && time curl -H "Accept: application/json" https://cid.contact/multihash/$CID\?cascade\=ipfs-dht -i
HTTP/2 200
[..] 2.386 total

$ ID=$(date | ipfs block put -f v0) && sleep 5 && time curl -H "Accept: application/json" https://cid.contact/multihash/$CID\?cascade\=ipfs-dht -i
HTTP/2 404

Revert Strategy

You can always undo this 1 line change 🤷

The user adds content to IPFS node, then tries to fetch it in a browser before sharing link with a friend.
For gateways backed by Saturn L1s (which use IPNI DHT proxy), we hit a problem. Different Saturn L1s will ask IPNI for the same CID multiple times.
When one L1 fails, Caboose (Saturn Client) will retry  using a different L1, and that will also fail due to cached 404.
User will try to refresh the page, and it will  fail again because there is no L1 that can find providers for the next ~300 seconds.
User is unable to fetch content which is available on DHT for the next 6 minutes, and will report IPFS is slow  / flakey, while in reality, a HTTP cache on centralized service is set too high, and that is all.

Proposal:
Lowering this to 5s should still protect you from unwanted load spikes,
but will fix false-negative content routing errors for end users.
@masih
Copy link
Member

masih commented Mar 1, 2023

Deferred until lassie moves to ndjson and it is evident that caching parameters are not a good fit.

lidel added a commit to filecoin-saturn/caboose that referenced this pull request Mar 2, 2023
- human-readable retry-after errors
- adjusted CID cooldown/failure cache to minimize end user impact
  (1m delay is not the best, but way better than 5m which is plain user
  hostile – context: ipni/storetheindex#1344)
- easier to read and reason about durations, prepare for upstream
  support of RetryAfter erroris in go-libipfs/gateway
@masih
Copy link
Member

masih commented Mar 7, 2023

404s reduced significantly after Lassie moved to ndjson response instead if non-streaming JSON

image

Regardless, going to experiment with lower 404 cache TTL and it's impact on query amplification at cid.contact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants