Add idempotence/retry behavior to ILP-over-HTTP #567

kincaidoneil · 2019-12-16T21:18:39Z

ILP-over-HTTP has a known issue that if the response containing a Fulfill packet is dropped (e.g. due to a network error), there's no mechanism to retry it, so the balances between the two peers will get out of sync and require manual reconciliation. Also, if the request with an ILP Prepare is dropped, it won't be rejected until the packet expires, which is undesirable since payments may take longer.

This proposes a backwards compatible change to ILP-over-HTTP to separate the Prepare and Fulfill/Reject into two separate HTTP request/responses. Each packet has a corresponding idempotency key enabling clients to safely retry packets without triggering the same side effect multiple times.

The flow looks roughly like:

Alice sends POST with the ILP Prepare to Bob.
Bob begins processing ILP Prepare and responds with a 200 status. The response is dropped due to a network error.
Alice doesn't receive any response after 5 seconds (for example), and retries sending the Prepare.
Bob ignores the Prepare since the request has an idempotency key he's already seen, and responds again with a 200.
Alice gets the response, and knows no further retries are necessary.
Bob gets the corresponding ILP Fulfill, and sends a POST to Alice with the Fulfill packet and same idempotency key as the ILP Prepare.
Alice gets the request and responds with a 200 status.
Bob receives the 200 response and knows no retry is necessary.

0035-ilp-over-http/0035-ilp-over-http.md

matdehaast · 2019-12-17T09:15:09Z

Async Request/Response!

That is how the Mojaloop API handles all REST API calls. I will say it does make life trickier for development and debugging. But alleviates all the statefulness of the actual request, which allows for better Load Balancing, queuing and generally a more scalable architecture IMO.

I am really in favour of this because when I think a packet goes 3-4 hops, then essentially you have this long chained sync http connection over multiple hops that could be the source of any number of errors.

0035-ilp-over-http/0035-ilp-over-http.md

kincaidoneil · 2019-12-17T16:00:59Z

One caveat to the backwards compatibility: AFAIK, there's no safe way to retry sending ILP Prepares to a node supporting the old version of the spec without them processing the Prepare multiple times (right?). If the Prepare is retried as a result of a network error, this is probably fine, since it's no worse than the current version (money could be lost)! However, if the client is retrying sending the Prepare e.g. every 5 seconds as long as it hasn't received any response to the original request, the older node would process the same Prepare multiple times.

I added a note that nodes must know whether their peer supports draft 3 or later before retrying sending Prepares, but how should this be implemented? A per-account flag?

060732f...f4fc002#diff-ccfda9b7df50a9cfd8618bb9c08e360aR116

sharafian · 2019-12-17T22:09:56Z

If we want to smoothly upgrade this we could add some version negotiation. you can carry an accept header on the prepare that informs whether you're OK with an async response, and the response will come back with a content type which communicates whether it's just an empty ACK or whether it's carrying the fulfillment.

I do think we should be careful about retrying packets at the link layer though. Imagine there's a network error preventing your fulfills from getting back to me (like my load balancer isn't accepting traffic from you but yours is OK). My connector is going to be accumulating all of these prepare packets that it continually retries until they time out. And the original sender is probably going to be sending more packets too. Your systems will be trying to send back an increasing number of fulfills. It could lead to a big slowdown where everyone gets mucked up with packets that keep getting rejected.

It might be good to investigate the prevalence of this error before we go about solving it. If you can measure how many packets you never got a fulfill/reject for, it would inform what we should be solving for. If it never happens in practice, then maybe it's not worth adding the memory footprint of tracking idempotence IDs for 100's to 1000's of packets per second. And if it happens a lot, maybe there's some network configuration upgrades we can do instead of a protocol change.

All that criticism is separate from splitting Prepare from Fulfill and Reject. I think an async flow will probably have benefits separate from idempotence (like Matt said, it leaves much fewer hanging connections). Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded?

sentientwaffle · 2019-12-17T22:13:08Z

If no HTTP response is received within a given timeout, clients SHOULD retry sending the packet with the same idempotency key.

When the sender has gotten a 200 from the receiver (but has not received a corresponding Fulfill or Reject), why is it the sender's responsibility to retry? The receiver already has the Prepare, and could keep retrying their fulfill/reject request until geting a 200 back from the original sender.

Aside: as both parties of a payment are simultaneously HTTP clients and servers it would be clearer to refer to them in the spec as "sender"/"receiver" rather than "client"/"server".

0035-ilp-over-http/0035-ilp-over-http.md

Co-Authored-By: David Fuelling <sappenin@gmail.com>

- reorganize for clarity - clarify async mode is optional, sync mode is still supported - clarify behavior of sender and receiver for retries - use 202 Accepted for async response - retrying ILP Prepares is optional and must be negotiated out of band - add timeout after Prepare expries before purging old idempotency keys

kincaidoneil · 2020-01-06T17:17:41Z

When the sender has gotten a 200 from the receiver (but has not received a corresponding Fulfill or Reject), why is it the sender's responsibility to retry? The receiver already has the Prepare, and could keep retrying their fulfill/reject request until geting a 200 back from the original sender.

@sentientwaffle Yep! (The behavior you describe was how I intended it to be read, but I agree that "client" and "server" were unclear). I updated the language with "sender of ILP Prepare," "recipient of ILP Fulfill/Reject," etc. to try to better clarify this.

Until there's broad consensus to deprecate Draft 2, I think this RFC should support both sync and async modes, or if there disagreement there, perhaps we should define this async behavior in a new specification.

@sappenin I reorganized and clarified the doc to better differentiate between the two modes and that they're both supported.

Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded?

Good points @sharafian. Definitely agree it's worth measuring how prevalent this is, though even a very low failure rate could be costly in the time required for manual reconciliation. Re: horizontal scaling: the async mode does seem to complicate this. There's been a little discussion about adding crash recovery to Rust; if in-flight packets were written to e.g. a Redis store shared between instances, that might make it possible to support.

I think this is ready for re-review when/if anyone gets a chance!

sharafian

Looking good, I like the support for synchronous mode to make the upgrade path easier

0035-ilp-over-http/0035-ilp-over-http.md

sharafian · 2020-01-22T19:49:45Z

0035-ilp-over-http/0035-ilp-over-http.md

+
+The sender MUST conclude retrying after receiving a response with a `2xx` status, `4xx` error, or receiving the corresponding ILP Fulfill/Reject packet.
+
+The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires.


It seems like if a Prepare fails there's really nothing lost so it'd be best to fast-fail and let the sender retry if they so choose. That keeps packets faster, rathing than having everyone implement their own retry logic that might be waiting some timeout to retry the packet, slowing everyone down.

The only reason I see to not fast fail would be if there's a risk that a 5xx came back but the packet was actually passed on. But if you treated the 5xx as a reject and didn't honor a fulfill that came after it then that wouldn't really matter.

The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires.

It seems like if a Prepare fails there's really nothing lost so it'd be best to fast-fail and let the sender retry if they so choose.

If Bob's Prepare packet is temporarily interrupted in transit and doesn't reach Charlie, retrying at the link layer allows much quicker recovery. To Bob, this would appear as a network error, but they can't safely return a reject right then because it's possible Charlie received the packet and it'll ultimately get fulfilled (if Charlie's response was interrupted in transit, but Bob doesn't know). Currently:

All previous nodes in the path have to wait the entire timeout before the hold on the balance is released, which slows down the payment and prevents re-allocating that liquidity

Afterward, the original sender of the Prepare has to wait for the Reject to percolate all the way back before retrying (e.g. assuming the number of packets in flight is capped), which may increase the time to complete the payment

None of this is imperative, so it's an optional feature that two peers can choose to implement or enable.

0035-ilp-over-http/0035-ilp-over-http.md

sharafian · 2020-01-22T19:58:26Z

0035-ilp-over-http/0035-ilp-over-http.md

+
+Every Interledger packet corresponds to a transaction that may affect financial accounting balances. If a request fails, such as due to a network connection error, retrying ILP requests and responses with [idempotence](https://en.wikipedia.org/wiki/Idempotence) can prevent balance inconsistencies between peers.
+
+When a connector begins processing an incoming packet with an idempotency key it has not already tracked, it MUST persist that key. If a subsequent request of the same type (ILP Prepare, or ILP Fulfill/Reject) is encountered with the same idempotency key, the packet should be ignored, and the connector should respond with the successful status. For safety, the connector MUST persist each idempotency key until some amount of time after the corresponding ILP Prepare expires so it doesn't accidentally process a duplicate ILP Prepare packet.


if we want to make this nonbreaking we should clarify that idempotence is a feature of asynchronous mode only. If we want to make it nonbreaking and also add idempotence to synchronous mode then we should have a response code/header of some kind that makes it clear the idempotency key will be honored

Since ILP Prepare retries must be negotiated out of band, isn't this still non-breaking? If the peers don't retry ILP Prepares, the recipient of the ILP Prepare doesn't need to handle idempotent requests. Can I clarify that better?

Added an explanation here that hopefully better clarifies this: 09916ea...8b43daf#diff-ccfda9b7df50a9cfd8618bb9c08e360aR126-R132

- clarify why ILP Prepares are retried, and why it needs to be negotiated - minor edits - remove 409 retry behavior - clarify when idempotence should be used

adrianhopebailie · 2020-01-24T11:10:06Z

This is a neat way to solve the issue, and as @matdehaast points out, this is the pattern we've seen used for Mojaloop.

I do wonder if the async pattern is a requirement for idempotency? It seems a pity to lose the one aspect of HTTP that was especially useful, request/response matching.

What if we simply added the idempotency key and expected a client to terminate the request and retry with the same key if it doesn't get a response?

If you use HTTP/2 this would only cost you a new stream which is very cheap.

0035-ilp-over-http/0035-ilp-over-http.md

stale · 2020-03-07T22:07:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

stale · 2020-06-23T04:47:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

kincaidoneil · 2020-09-14T17:39:59Z

I made some changes to this proposal:

Added a `Callback-Url` header to the request

Previously, the idea was the callback URL for asynchronous replies would be statically configured for each peer. However, that complicates the implementation if the ILP Prepare sender operated multiple nodes behind a load balancer: they must cache which node handled the original ILP Prepare, in order to direct the ILP reply to the correct node. By including a callback URL in each individual HTTP request, this requires no static configuration, and enables the ILP reply to be sent directly to the node instance which sent the ILP Prepare and has its associated state available (they can still choose to use a load balancer).

Removed idempotent ILP Prepare retries (for now)

Retrying Prepares with idempotence requires the ILP Prepare receiver to operate a cache of idempotence keys, and is a breaking change for existing peering relationships (both peers must enable ILP Prepare retries for it to be safe). Note, however: this doesn't apply to idempotent retries on the ILP reply leg, which are still supported here.

I do still think we should add this feature, eventually. The downside of not retrying Prepares is if a request fails due to a transient error, the sender must wait the full timeout until they can retry it. For retail payments, the payment will take an additional 30 seconds (since holds prevent unsafe amounts in-flight). Also, for example, the pay STREAM implementation would terminate the payment, since it took too long -- which I believe is the right behavior.

sharafian · 2020-09-17T17:59:22Z

0035-ilp-over-http/0035-ilp-over-http.md

-All ILP Packets MUST be returned with the HTTP status code `200: OK`.
+If the request included a `Callback-Url` header and `Request-Id` header, the recipient handling the ILP Prepare SHOULD choose to handle the packet asynchronously and return the corresponding ILP Fulfill/Reject in a separate outgoing HTTP request.
+
+In the asynchronous mode, if the request is semantically valid, the recipient MUST respond immediately that the ILP Prepare is accepted for processing, even if the packet will ultimately be rejected:


Is it a problem that until the response comes back the sender doesn't know whether asynchronous mode is being used? If it takes a couple seconds the connector isn't sure whether it's taking a long time to ACK or whether it's waiting for a fulfill/reject. I can't think of any concrete case where that's a problem but it seems slightly odd

If the request is still pending but just taking a long time, no behavior changes since it's unsafe for any sender to retry (until we add idempotent Prepare retries).

Today, if a request is sent but then the connection is terminated, the sender immediately treats the packet as rejected. If the sender supports async, they must wait the full timeout even if the connection is terminated, since they may receive an async response. Note that this only happens when the reply is dropped, so it's the existing failure scenario which gets slightly worse (if the sender supports async but recipient doesn't).

stale · 2020-10-24T03:43:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

stale · 2021-06-26T03:02:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

matdehaast

Just some general cleanup but otherwise good

0035-ilp-over-http/0035-ilp-over-http.md

matdehaast

👍

@kincaidoneil last question. If I receive a request that specifies Prefer: respond-async and I don't support it. Is there a normative response that I must reply with?

kincaidoneil · 2021-07-05T17:04:38Z

@matdehaast For backwards compatibility, if the receiver doesn't support async, they just handle the request synchronously. The sender can identify which mode the receiver chose based on the status code of the response (200 vs 202) or if the body is non-empty.

sentientwaffle · 2021-07-06T14:03:32Z

0035-ilp-over-http/0035-ilp-over-http.md

+- **Callback URL Header** &mdash; _Optional_. Callback URL of the origin connector to send an asynchronous HTTP request with the ILP Fulfill/Reject.
+- **Request Id Header** &mdash; _Optional_. UUIDv4 to uniquely identify this ILP Prepare, and correlate the corresponding ILP Fulfill/Reject.


These aren't Optional in async mode, right?

Thanks. Changed Request-Id so it's required, but keeping the Callback-Url as optional so peers can still use other mechanisms to set/exchange the callback URL.

docs: idempotence/retry behavior for ILP-over-HTTP

060732f

kincaidoneil requested review from adrianhopebailie, sappenin, emschwartz, sharafian, matdehaast and gakonst December 16, 2019 21:18

kincaidoneil mentioned this pull request Dec 16, 2019

Add Idempotency to ILP over HTTP interledger/interledger-rs#560

Open

emschwartz reviewed Dec 16, 2019

View reviewed changes

matdehaast reviewed Dec 17, 2019

View reviewed changes

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

kincaidoneil requested a review from aanchal4 December 17, 2019 14:36

kincaidoneil added 3 commits December 17, 2019 09:22

fix: separate request id and idempotency key

0670b98

fix: error if no corresponding ILP Prepare

406a030

fix: caveat for retry behavior, clarifications

f4fc002

sappenin requested changes Dec 19, 2019

View reviewed changes

kincaidoneil and others added 2 commits December 20, 2019 08:28

fix: clarify support for sync mode

003ad42

Co-Authored-By: David Fuelling <sappenin@gmail.com>

sharafian requested changes Jan 22, 2020

View reviewed changes

docs: address comments

8b43daf

- clarify why ILP Prepares are retried, and why it needs to be negotiated - minor edits - remove 409 retry behavior - clarify when idempotence should be used

sappenin reviewed Feb 6, 2020

View reviewed changes

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

kincaidoneil mentioned this pull request Feb 19, 2020

Unique id per money event interledgerjs/ilp-protocol-stream#131

Closed

stale bot added the wontfix label Mar 7, 2020

stale bot closed this Mar 14, 2020

stale bot added the wontfix label Jun 23, 2020

stale bot closed this Jul 3, 2020

kincaidoneil removed the wontfix label Sep 8, 2020

kincaidoneil reopened this Sep 8, 2020

docs(http): callback header, omit Prepare retries

ea365d4

sharafian approved these changes Sep 17, 2020

View reviewed changes

stale bot added the wontfix label Oct 24, 2020

stale bot closed this Nov 1, 2020

kincaidoneil removed the wontfix label Nov 1, 2020

kincaidoneil reopened this Jan 19, 2021

stale bot added the wontfix label Jun 26, 2021

matdehaast removed the wontfix label Jun 28, 2021

sharafian requested a review from sentientwaffle July 1, 2021 17:49

matdehaast approved these changes Jul 2, 2021

View reviewed changes

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

sappenin requested changes Jul 2, 2021

View reviewed changes

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

docs(http): add prefer header, fix examples

c62dc5b

kincaidoneil requested review from matdehaast and sappenin July 4, 2021 21:34

matdehaast approved these changes Jul 5, 2021

View reviewed changes

sentientwaffle reviewed Jul 6, 2021

View reviewed changes

fix(http): require request id, callback headers

d2fdc14

sentientwaffle approved these changes Jul 6, 2021

View reviewed changes

sappenin approved these changes Jul 6, 2021

View reviewed changes

kincaidoneil merged commit 322ab73 into master Jul 6, 2021

huijing deleted the ko-ilp-http-idempotence branch November 1, 2023 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add idempotence/retry behavior to ILP-over-HTTP #567

Add idempotence/retry behavior to ILP-over-HTTP #567

kincaidoneil commented Dec 16, 2019

matdehaast commented Dec 17, 2019

kincaidoneil commented Dec 17, 2019 •

edited

Loading

sharafian commented Dec 17, 2019

sentientwaffle commented Dec 17, 2019

kincaidoneil commented Jan 6, 2020

sharafian left a comment

sharafian Jan 22, 2020

kincaidoneil Jan 23, 2020

sharafian Jan 22, 2020

kincaidoneil Jan 23, 2020

kincaidoneil Jan 23, 2020

adrianhopebailie commented Jan 24, 2020

stale bot commented Mar 7, 2020

stale bot commented Jun 23, 2020

kincaidoneil commented Sep 14, 2020

sharafian Sep 17, 2020

kincaidoneil Sep 21, 2020 •

edited

Loading

stale bot commented Oct 24, 2020

stale bot commented Jun 26, 2021

matdehaast left a comment

matdehaast left a comment

kincaidoneil commented Jul 5, 2021

sentientwaffle Jul 6, 2021

kincaidoneil Jul 6, 2021


		The sender MUST conclude retrying after receiving a response with a `2xx` status, `4xx` error, or receiving the corresponding ILP Fulfill/Reject packet.

		The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires.


		Every Interledger packet corresponds to a transaction that may affect financial accounting balances. If a request fails, such as due to a network connection error, retrying ILP requests and responses with [idempotence](https://en.wikipedia.org/wiki/Idempotence) can prevent balance inconsistencies between peers.

		When a connector begins processing an incoming packet with an idempotency key it has not already tracked, it MUST persist that key. If a subsequent request of the same type (ILP Prepare, or ILP Fulfill/Reject) is encountered with the same idempotency key, the packet should be ignored, and the connector should respond with the successful status. For safety, the connector MUST persist each idempotency key until some amount of time after the corresponding ILP Prepare expires so it doesn't accidentally process a duplicate ILP Prepare packet.

		- Callback URL Header — _Optional_. Callback URL of the origin connector to send an asynchronous HTTP request with the ILP Fulfill/Reject.
		- Request Id Header — _Optional_. UUIDv4 to uniquely identify this ILP Prepare, and correlate the corresponding ILP Fulfill/Reject.

Add idempotence/retry behavior to ILP-over-HTTP #567

Add idempotence/retry behavior to ILP-over-HTTP #567

Conversation

kincaidoneil commented Dec 16, 2019

matdehaast commented Dec 17, 2019

kincaidoneil commented Dec 17, 2019 • edited Loading

sharafian commented Dec 17, 2019

sentientwaffle commented Dec 17, 2019

kincaidoneil commented Jan 6, 2020

sharafian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianhopebailie commented Jan 24, 2020

stale bot commented Mar 7, 2020

stale bot commented Jun 23, 2020

kincaidoneil commented Sep 14, 2020

Added a Callback-Url header to the request

Removed idempotent ILP Prepare retries (for now)

Choose a reason for hiding this comment

kincaidoneil Sep 21, 2020 • edited Loading

Choose a reason for hiding this comment

stale bot commented Oct 24, 2020

stale bot commented Jun 26, 2021

matdehaast left a comment

Choose a reason for hiding this comment

matdehaast left a comment

Choose a reason for hiding this comment

kincaidoneil commented Jul 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kincaidoneil commented Dec 17, 2019 •

edited

Loading

Added a `Callback-Url` header to the request

kincaidoneil Sep 21, 2020 •

edited

Loading