Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add idempotence/retry behavior to ILP-over-HTTP #567

Merged
merged 10 commits into from
Jul 6, 2021

Conversation

kincaidoneil
Copy link
Member

ILP-over-HTTP has a known issue that if the response containing a Fulfill packet is dropped (e.g. due to a network error), there's no mechanism to retry it, so the balances between the two peers will get out of sync and require manual reconciliation. Also, if the request with an ILP Prepare is dropped, it won't be rejected until the packet expires, which is undesirable since payments may take longer.

This proposes a backwards compatible change to ILP-over-HTTP to separate the Prepare and Fulfill/Reject into two separate HTTP request/responses. Each packet has a corresponding idempotency key enabling clients to safely retry packets without triggering the same side effect multiple times.

The flow looks roughly like:

  1. Alice sends POST with the ILP Prepare to Bob.
  2. Bob begins processing ILP Prepare and responds with a 200 status. The response is dropped due to a network error.
  3. Alice doesn't receive any response after 5 seconds (for example), and retries sending the Prepare.
  4. Bob ignores the Prepare since the request has an idempotency key he's already seen, and responds again with a 200.
  5. Alice gets the response, and knows no further retries are necessary.
  6. Bob gets the corresponding ILP Fulfill, and sends a POST to Alice with the Fulfill packet and same idempotency key as the ILP Prepare.
  7. Alice gets the request and responds with a 200 status.
  8. Bob receives the 200 response and knows no retry is necessary.

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
@matdehaast
Copy link

Async Request/Response!

That is how the Mojaloop API handles all REST API calls. I will say it does make life trickier for development and debugging. But alleviates all the statefulness of the actual request, which allows for better Load Balancing, queuing and generally a more scalable architecture IMO.

I am really in favour of this because when I think a packet goes 3-4 hops, then essentially you have this long chained sync http connection over multiple hops that could be the source of any number of errors.

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
@kincaidoneil
Copy link
Member Author

kincaidoneil commented Dec 17, 2019

One caveat to the backwards compatibility: AFAIK, there's no safe way to retry sending ILP Prepares to a node supporting the old version of the spec without them processing the Prepare multiple times (right?). If the Prepare is retried as a result of a network error, this is probably fine, since it's no worse than the current version (money could be lost)! However, if the client is retrying sending the Prepare e.g. every 5 seconds as long as it hasn't received any response to the original request, the older node would process the same Prepare multiple times.

I added a note that nodes must know whether their peer supports draft 3 or later before retrying sending Prepares, but how should this be implemented? A per-account flag?

060732f...f4fc002#diff-ccfda9b7df50a9cfd8618bb9c08e360aR116

@sharafian
Copy link

If we want to smoothly upgrade this we could add some version negotiation. you can carry an accept header on the prepare that informs whether you're OK with an async response, and the response will come back with a content type which communicates whether it's just an empty ACK or whether it's carrying the fulfillment.

I do think we should be careful about retrying packets at the link layer though. Imagine there's a network error preventing your fulfills from getting back to me (like my load balancer isn't accepting traffic from you but yours is OK). My connector is going to be accumulating all of these prepare packets that it continually retries until they time out. And the original sender is probably going to be sending more packets too. Your systems will be trying to send back an increasing number of fulfills. It could lead to a big slowdown where everyone gets mucked up with packets that keep getting rejected.

It might be good to investigate the prevalence of this error before we go about solving it. If you can measure how many packets you never got a fulfill/reject for, it would inform what we should be solving for. If it never happens in practice, then maybe it's not worth adding the memory footprint of tracking idempotence IDs for 100's to 1000's of packets per second. And if it happens a lot, maybe there's some network configuration upgrades we can do instead of a protocol change.

All that criticism is separate from splitting Prepare from Fulfill and Reject. I think an async flow will probably have benefits separate from idempotence (like Matt said, it leaves much fewer hanging connections). Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded?

@sentientwaffle
Copy link
Contributor

If no HTTP response is received within a given timeout, clients SHOULD retry sending the packet with the same idempotency key.

When the sender has gotten a 200 from the receiver (but has not received a corresponding Fulfill or Reject), why is it the sender's responsibility to retry? The receiver already has the Prepare, and could keep retrying their fulfill/reject request until geting a 200 back from the original sender.

Aside: as both parties of a payment are simultaneously HTTP clients and servers it would be clearer to refer to them in the spec as "sender"/"receiver" rather than "client"/"server".

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
kincaidoneil and others added 2 commits December 20, 2019 08:28
Co-Authored-By: David Fuelling <sappenin@gmail.com>
- reorganize for clarity
- clarify async mode is optional, sync mode is still supported
- clarify behavior of sender and receiver for retries
- use 202 Accepted for async response
- retrying ILP Prepares is optional and must be negotiated out of band
- add timeout after Prepare expries before purging old idempotency keys
@kincaidoneil
Copy link
Member Author

When the sender has gotten a 200 from the receiver (but has not received a corresponding Fulfill or Reject), why is it the sender's responsibility to retry? The receiver already has the Prepare, and could keep retrying their fulfill/reject request until geting a 200 back from the original sender.

@sentientwaffle Yep! (The behavior you describe was how I intended it to be read, but I agree that "client" and "server" were unclear). I updated the language with "sender of ILP Prepare," "recipient of ILP Fulfill/Reject," etc. to try to better clarify this.

Until there's broad consensus to deprecate Draft 2, I think this RFC should support both sync and async modes, or if there disagreement there, perhaps we should define this async behavior in a new specification.

@sappenin I reorganized and clarified the doc to better differentiate between the two modes and that they're both supported.

Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded?

Good points @sharafian. Definitely agree it's worth measuring how prevalent this is, though even a very low failure rate could be costly in the time required for manual reconciliation. Re: horizontal scaling: the async mode does seem to complicate this. There's been a little discussion about adding crash recovery to Rust; if in-flight packets were written to e.g. a Redis store shared between instances, that might make it possible to support.

I think this is ready for re-review when/if anyone gets a chance!

Copy link

@sharafian sharafian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I like the support for synchronous mode to make the upgrade path easier

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

The sender MUST conclude retrying after receiving a response with a `2xx` status, `4xx` error, or receiving the corresponding ILP Fulfill/Reject packet.

The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like if a Prepare fails there's really nothing lost so it'd be best to fast-fail and let the sender retry if they so choose. That keeps packets faster, rathing than having everyone implement their own retry logic that might be waiting some timeout to retry the packet, slowing everyone down.

The only reason I see to not fast fail would be if there's a risk that a 5xx came back but the packet was actually passed on. But if you treated the 5xx as a reject and didn't honor a fulfill that came after it then that wouldn't really matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires.

It seems like if a Prepare fails there's really nothing lost so it'd be best to fast-fail and let the sender retry if they so choose.

If Bob's Prepare packet is temporarily interrupted in transit and doesn't reach Charlie, retrying at the link layer allows much quicker recovery. To Bob, this would appear as a network error, but they can't safely return a reject right then because it's possible Charlie received the packet and it'll ultimately get fulfilled (if Charlie's response was interrupted in transit, but Bob doesn't know). Currently:

  1. All previous nodes in the path have to wait the entire timeout before the hold on the balance is released, which slows down the payment and prevents re-allocating that liquidity
  2. Afterward, the original sender of the Prepare has to wait for the Reject to percolate all the way back before retrying (e.g. assuming the number of packets in flight is capped), which may increase the time to complete the payment

None of this is imperative, so it's an optional feature that two peers can choose to implement or enable.

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved

Every Interledger packet corresponds to a transaction that may affect financial accounting balances. If a request fails, such as due to a network connection error, retrying ILP requests and responses with [idempotence](https://en.wikipedia.org/wiki/Idempotence) can prevent balance inconsistencies between peers.

When a connector begins processing an incoming packet with an idempotency key it has not already tracked, it MUST persist that key. If a subsequent request of the same type (ILP Prepare, or ILP Fulfill/Reject) is encountered with the same idempotency key, the packet should be ignored, and the connector should respond with the successful status. For safety, the connector MUST persist each idempotency key until some amount of time after the corresponding ILP Prepare expires so it doesn't accidentally process a duplicate ILP Prepare packet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to make this nonbreaking we should clarify that idempotence is a feature of asynchronous mode only. If we want to make it nonbreaking and also add idempotence to synchronous mode then we should have a response code/header of some kind that makes it clear the idempotency key will be honored

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ILP Prepare retries must be negotiated out of band, isn't this still non-breaking? If the peers don't retry ILP Prepares, the recipient of the ILP Prepare doesn't need to handle idempotent requests. Can I clarify that better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explanation here that hopefully better clarifies this: 09916ea...8b43daf#diff-ccfda9b7df50a9cfd8618bb9c08e360aR126-R132

- clarify why ILP Prepares are retried, and why it needs to be negotiated
- minor edits
- remove 409 retry behavior
- clarify when idempotence should be used
@adrianhopebailie
Copy link
Collaborator

This is a neat way to solve the issue, and as @matdehaast points out, this is the pattern we've seen used for Mojaloop.

I do wonder if the async pattern is a requirement for idempotency? It seems a pity to lose the one aspect of HTTP that was especially useful, request/response matching.

What if we simply added the idempotency key and expected a client to terminate the request and retry with the same key if it doesn't get a response?

If you use HTTP/2 this would only cost you a new stream which is very cheap.

@stale
Copy link

stale bot commented Mar 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

@stale stale bot added the wontfix label Mar 7, 2020
@stale stale bot closed this Mar 14, 2020
@stale
Copy link

stale bot commented Jun 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

@stale stale bot added the wontfix label Jun 23, 2020
@stale stale bot closed this Jul 3, 2020
@kincaidoneil kincaidoneil reopened this Sep 8, 2020
@kincaidoneil
Copy link
Member Author

I made some changes to this proposal:

Added a Callback-Url header to the request

Previously, the idea was the callback URL for asynchronous replies would be statically configured for each peer. However, that complicates the implementation if the ILP Prepare sender operated multiple nodes behind a load balancer: they must cache which node handled the original ILP Prepare, in order to direct the ILP reply to the correct node. By including a callback URL in each individual HTTP request, this requires no static configuration, and enables the ILP reply to be sent directly to the node instance which sent the ILP Prepare and has its associated state available (they can still choose to use a load balancer).

Removed idempotent ILP Prepare retries (for now)

Retrying Prepares with idempotence requires the ILP Prepare receiver to operate a cache of idempotence keys, and is a breaking change for existing peering relationships (both peers must enable ILP Prepare retries for it to be safe). Note, however: this doesn't apply to idempotent retries on the ILP reply leg, which are still supported here.

I do still think we should add this feature, eventually. The downside of not retrying Prepares is if a request fails due to a transient error, the sender must wait the full timeout until they can retry it. For retail payments, the payment will take an additional 30 seconds (since holds prevent unsafe amounts in-flight). Also, for example, the pay STREAM implementation would terminate the payment, since it took too long -- which I believe is the right behavior.

All ILP Packets MUST be returned with the HTTP status code `200: OK`.
If the request included a `Callback-Url` header and `Request-Id` header, the recipient handling the ILP Prepare SHOULD choose to handle the packet asynchronously and return the corresponding ILP Fulfill/Reject in a separate outgoing HTTP request.

In the asynchronous mode, if the request is semantically valid, the recipient MUST respond immediately that the ILP Prepare is accepted for processing, even if the packet will ultimately be rejected:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem that until the response comes back the sender doesn't know whether asynchronous mode is being used? If it takes a couple seconds the connector isn't sure whether it's taking a long time to ACK or whether it's waiting for a fulfill/reject. I can't think of any concrete case where that's a problem but it seems slightly odd

Copy link
Member Author

@kincaidoneil kincaidoneil Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the request is still pending but just taking a long time, no behavior changes since it's unsafe for any sender to retry (until we add idempotent Prepare retries).

Today, if a request is sent but then the connection is terminated, the sender immediately treats the packet as rejected. If the sender supports async, they must wait the full timeout even if the connection is terminated, since they may receive an async response. Note that this only happens when the reply is dropped, so it's the existing failure scenario which gets slightly worse (if the sender supports async but recipient doesn't).

@stale
Copy link

stale bot commented Oct 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

@stale stale bot added the wontfix label Oct 24, 2020
@stale stale bot closed this Nov 1, 2020
@kincaidoneil kincaidoneil reopened this Jan 19, 2021
@stale
Copy link

stale bot commented Jun 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

Copy link

@matdehaast matdehaast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some general cleanup but otherwise good

0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
0035-ilp-over-http/0035-ilp-over-http.md Outdated Show resolved Hide resolved
Copy link

@matdehaast matdehaast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@kincaidoneil last question. If I receive a request that specifies Prefer: respond-async and I don't support it. Is there a normative response that I must reply with?

@kincaidoneil
Copy link
Member Author

@matdehaast For backwards compatibility, if the receiver doesn't support async, they just handle the request synchronously. The sender can identify which mode the receiver chose based on the status code of the response (200 vs 202) or if the body is non-empty.

Comment on lines 53 to 54
- **Callback URL Header** &mdash; _Optional_. Callback URL of the origin connector to send an asynchronous HTTP request with the ILP Fulfill/Reject.
- **Request Id Header** &mdash; _Optional_. UUIDv4 to uniquely identify this ILP Prepare, and correlate the corresponding ILP Fulfill/Reject.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't Optional in async mode, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed Request-Id so it's required, but keeping the Callback-Url as optional so peers can still use other mechanisms to set/exchange the callback URL.

@kincaidoneil kincaidoneil merged commit 322ab73 into master Jul 6, 2021
@huijing huijing deleted the ko-ilp-http-idempotence branch November 1, 2023 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants