fix: lower 404 ttl to decrease end user failures #1344

lidel · 2023-03-01T22:22:00Z

TLDR

Amazon Cloudfront HTTP caching of false-negative CID lookups is DoS-ing all Saturn L1s using Lassie.

Context for users trying to access content via ipfs.io gateway

Any hiccup in content routing of a CID is cached for 5 minutes, no L11 can retrieve it, gateway can't return it.

Context for users (developers) running their own IPFS node / trying IPFS for the first time

The user adds content to IPFS node, then tries to open it in a browser before sharing link with a friend.
- 👉 This is what every IPFS user was trained to do for the past 7 years, and tools like IPFS Companion and Brave copy shareable links to clipboard, so user can paste and load CID via gateway within seconds.
- 👉 This is also what every developer, CTO making decisions / evaluating IPFS will do in the first hour of playing with IPFS.
In Rhea gateways backed by Saturn L1s IPNI DHT proxy is used for content routing, and that creates UX papercut:
- Different Saturn L1s will ask IPNI for the same CID multiple times.
- If a user tries to load content too fast, and one L1+Lassie fails to find providers, Caboose (Saturn Client) will retry using a different L1, and that will also fail due to cached 404.
- User will try to refresh the page, and it will fail again because there is no L1 that can find providers until cached 404 expires (300s)

👉 End result? User is unable to fetch content which is available on DHT for the next 5 minutes, and will report or form an opinion that IPFS DHT is slow / flakey, while in reality, it was HTTP cache on centralized service set too high.

Proposed Changes

Lowering cloudfront cache TTL for 404 errors to 5s will fix false-negative content routing errors for end users.
It should still protect you from unwanted load spikes, but the end user will be able to refresh the page without waiting 5 minutes to see their content.

Happy to discuss other values, but 5 minutes is way too high:
majority of users wont wait and retry after 5 minutes, they will just give up on IPFS.

Tests

Announcing a single block on DHT and then asking indexer for it sometimes produces 404, and that is cached for 5 minutes, artificially breaking content routing resolution for that CID on Rhea.

$ ID=$(date | ipfs block put -f v0) && sleep 5 && time curl -H "Accept: application/json" https://cid.contact/multihash/$CID\?cascade\=ipfs-dht -i
HTTP/2 200
[..] 2.386 total

$ ID=$(date | ipfs block put -f v0) && sleep 5 && time curl -H "Accept: application/json" https://cid.contact/multihash/$CID\?cascade\=ipfs-dht -i
HTTP/2 404

Revert Strategy

You can always undo this 1 line change 🤷

The user adds content to IPFS node, then tries to fetch it in a browser before sharing link with a friend. For gateways backed by Saturn L1s (which use IPNI DHT proxy), we hit a problem. Different Saturn L1s will ask IPNI for the same CID multiple times. When one L1 fails, Caboose (Saturn Client) will retry using a different L1, and that will also fail due to cached 404. User will try to refresh the page, and it will fail again because there is no L1 that can find providers for the next ~300 seconds. User is unable to fetch content which is available on DHT for the next 6 minutes, and will report IPFS is slow / flakey, while in reality, a HTTP cache on centralized service is set too high, and that is all. Proposal: Lowering this to 5s should still protect you from unwanted load spikes, but will fix false-negative content routing errors for end users.

masih · 2023-03-01T22:30:16Z

Deferred until lassie moves to ndjson and it is evident that caching parameters are not a good fit.

- human-readable retry-after errors - adjusted CID cooldown/failure cache to minimize end user impact (1m delay is not the best, but way better than 5m which is plain user hostile – context: ipni/storetheindex#1344) - easier to read and reason about durations, prepare for upstream support of RetryAfter erroris in go-libipfs/gateway

masih · 2023-03-07T11:51:00Z

404s reduced significantly after Lassie moved to ndjson response instead if non-streaming JSON

Regardless, going to experiment with lower 404 cache TTL and it's impact on query amplification at cid.contact.

lidel mentioned this pull request Mar 2, 2023

Better downvoting and cool down fetches filecoin-saturn/caboose#59

Merged

3 tasks

masih approved these changes Mar 7, 2023

View reviewed changes

masih merged commit 0095821 into ipni:main Mar 7, 2023

masih mentioned this pull request Mar 7, 2023

Revert "fix: lower 404 ttl to decrease end user failures" #1365

Merged

lidel deleted the patch-1 branch March 7, 2023 21:58

lidel mentioned this pull request Mar 9, 2023

Add Retry-After and Cache-Control headers ipni/indexstar#91

Open

lidel mentioned this pull request Oct 30, 2023

feat(gw): Ipfs-Gateway-Mode: path|trustless ipfs/boxo#495

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lower 404 ttl to decrease end user failures #1344

fix: lower 404 ttl to decrease end user failures #1344

lidel commented Mar 1, 2023 •

edited

Loading

masih commented Mar 1, 2023

masih commented Mar 7, 2023

fix: lower 404 ttl to decrease end user failures #1344

fix: lower 404 ttl to decrease end user failures #1344

Conversation

lidel commented Mar 1, 2023 • edited Loading

TLDR

Context for users trying to access content via ipfs.io gateway

Context for users (developers) running their own IPFS node / trying IPFS for the first time

Proposed Changes

Tests

Revert Strategy

masih commented Mar 1, 2023

masih commented Mar 7, 2023

lidel commented Mar 1, 2023 •

edited

Loading