-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add idempotence/retry behavior to ILP-over-HTTP #567
Conversation
Async Request/Response! That is how the Mojaloop API handles all REST API calls. I will say it does make life trickier for development and debugging. But alleviates all the statefulness of the actual request, which allows for better Load Balancing, queuing and generally a more scalable architecture IMO. I am really in favour of this because when I think a packet goes 3-4 hops, then essentially you have this long chained sync http connection over multiple hops that could be the source of any number of errors. |
One caveat to the backwards compatibility: AFAIK, there's no safe way to retry sending ILP Prepares to a node supporting the old version of the spec without them processing the Prepare multiple times (right?). If the Prepare is retried as a result of a network error, this is probably fine, since it's no worse than the current version (money could be lost)! However, if the client is retrying sending the Prepare e.g. every 5 seconds as long as it hasn't received any response to the original request, the older node would process the same Prepare multiple times. I added a note that nodes must know whether their peer supports draft 3 or later before retrying sending Prepares, but how should this be implemented? A per-account flag? |
If we want to smoothly upgrade this we could add some version negotiation. you can carry an accept header on the prepare that informs whether you're OK with an async response, and the response will come back with a content type which communicates whether it's just an empty ACK or whether it's carrying the fulfillment. I do think we should be careful about retrying packets at the link layer though. Imagine there's a network error preventing your fulfills from getting back to me (like my load balancer isn't accepting traffic from you but yours is OK). My connector is going to be accumulating all of these prepare packets that it continually retries until they time out. And the original sender is probably going to be sending more packets too. Your systems will be trying to send back an increasing number of fulfills. It could lead to a big slowdown where everyone gets mucked up with packets that keep getting rejected. It might be good to investigate the prevalence of this error before we go about solving it. If you can measure how many packets you never got a fulfill/reject for, it would inform what we should be solving for. If it never happens in practice, then maybe it's not worth adding the memory footprint of tracking idempotence IDs for 100's to 1000's of packets per second. And if it happens a lot, maybe there's some network configuration upgrades we can do instead of a protocol change. All that criticism is separate from splitting Prepare from Fulfill and Reject. I think an async flow will probably have benefits separate from idempotence (like Matt said, it leaves much fewer hanging connections). Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded? |
When the sender has gotten a Aside: as both parties of a payment are simultaneously HTTP clients and servers it would be clearer to refer to them in the spec as "sender"/"receiver" rather than "client"/"server". |
Co-Authored-By: David Fuelling <sappenin@gmail.com>
- reorganize for clarity - clarify async mode is optional, sync mode is still supported - clarify behavior of sender and receiver for retries - use 202 Accepted for async response - retrying ILP Prepares is optional and must be negotiated out of band - add timeout after Prepare expries before purging old idempotency keys
@sentientwaffle Yep! (The behavior you describe was how I intended it to be read, but I agree that "client" and "server" were unclear). I updated the language with "sender of ILP Prepare," "recipient of ILP Fulfill/Reject," etc. to try to better clarify this.
@sappenin I reorganized and clarified the doc to better differentiate between the two modes and that they're both supported.
Good points @sharafian. Definitely agree it's worth measuring how prevalent this is, though even a very low failure rate could be costly in the time required for manual reconciliation. Re: horizontal scaling: the async mode does seem to complicate this. There's been a little discussion about adding crash recovery to Rust; if in-flight packets were written to e.g. a Redis store shared between instances, that might make it possible to support. I think this is ready for re-review when/if anyone gets a chance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I like the support for synchronous mode to make the upgrade path easier
|
||
The sender MUST conclude retrying after receiving a response with a `2xx` status, `4xx` error, or receiving the corresponding ILP Fulfill/Reject packet. | ||
|
||
The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like if a Prepare fails there's really nothing lost so it'd be best to fast-fail and let the sender retry if they so choose. That keeps packets faster, rathing than having everyone implement their own retry logic that might be waiting some timeout to retry the packet, slowing everyone down.
The only reason I see to not fast fail would be if there's a risk that a 5xx came back but the packet was actually passed on. But if you treated the 5xx as a reject and didn't honor a fulfill that came after it then that wouldn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sender SHOULD ensure there are multiple attempts to deliver the packet to the peer before the ILP Prepare expires.
It seems like if a Prepare fails there's really nothing lost so it'd be best to fast-fail and let the sender retry if they so choose.
If Bob's Prepare packet is temporarily interrupted in transit and doesn't reach Charlie, retrying at the link layer allows much quicker recovery. To Bob, this would appear as a network error, but they can't safely return a reject right then because it's possible Charlie received the packet and it'll ultimately get fulfilled (if Charlie's response was interrupted in transit, but Bob doesn't know). Currently:
- All previous nodes in the path have to wait the entire timeout before the hold on the balance is released, which slows down the payment and prevents re-allocating that liquidity
- Afterward, the original sender of the Prepare has to wait for the Reject to percolate all the way back before retrying (e.g. assuming the number of packets in flight is capped), which may increase the time to complete the payment
None of this is imperative, so it's an optional feature that two peers can choose to implement or enable.
|
||
Every Interledger packet corresponds to a transaction that may affect financial accounting balances. If a request fails, such as due to a network connection error, retrying ILP requests and responses with [idempotence](https://en.wikipedia.org/wiki/Idempotence) can prevent balance inconsistencies between peers. | ||
|
||
When a connector begins processing an incoming packet with an idempotency key it has not already tracked, it MUST persist that key. If a subsequent request of the same type (ILP Prepare, or ILP Fulfill/Reject) is encountered with the same idempotency key, the packet should be ignored, and the connector should respond with the successful status. For safety, the connector MUST persist each idempotency key until some amount of time after the corresponding ILP Prepare expires so it doesn't accidentally process a duplicate ILP Prepare packet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to make this nonbreaking we should clarify that idempotence is a feature of asynchronous mode only. If we want to make it nonbreaking and also add idempotence to synchronous mode then we should have a response code/header of some kind that makes it clear the idempotency key will be honored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ILP Prepare retries must be negotiated out of band, isn't this still non-breaking? If the peers don't retry ILP Prepares, the recipient of the ILP Prepare doesn't need to handle idempotent requests. Can I clarify that better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explanation here that hopefully better clarifies this: 09916ea...8b43daf#diff-ccfda9b7df50a9cfd8618bb9c08e360aR126-R132
- clarify why ILP Prepares are retried, and why it needs to be negotiated - minor edits - remove 409 retry behavior - clarify when idempotence should be used
This is a neat way to solve the issue, and as @matdehaast points out, this is the pattern we've seen used for Mojaloop. I do wonder if the async pattern is a requirement for idempotency? It seems a pity to lose the one aspect of HTTP that was especially useful, request/response matching. What if we simply added the idempotency key and expected a client to terminate the request and retry with the same key if it doesn't get a response? If you use HTTP/2 this would only cost you a new stream which is very cheap. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat. |
I made some changes to this proposal: Added a
|
All ILP Packets MUST be returned with the HTTP status code `200: OK`. | ||
If the request included a `Callback-Url` header and `Request-Id` header, the recipient handling the ILP Prepare SHOULD choose to handle the packet asynchronously and return the corresponding ILP Fulfill/Reject in a separate outgoing HTTP request. | ||
|
||
In the asynchronous mode, if the request is semantically valid, the recipient MUST respond immediately that the ILP Prepare is accepted for processing, even if the packet will ultimately be rejected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a problem that until the response comes back the sender doesn't know whether asynchronous mode is being used? If it takes a couple seconds the connector isn't sure whether it's taking a long time to ACK or whether it's waiting for a fulfill/reject. I can't think of any concrete case where that's a problem but it seems slightly odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the request is still pending but just taking a long time, no behavior changes since it's unsafe for any sender to retry (until we add idempotent Prepare retries).
Today, if a request is sent but then the connection is terminated, the sender immediately treats the packet as rejected. If the sender supports async, they must wait the full timeout even if the connection is terminated, since they may receive an async response. Note that this only happens when the reply is dropped, so it's the existing failure scenario which gets slightly worse (if the sender supports async but recipient doesn't).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some general cleanup but otherwise good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@kincaidoneil last question. If I receive a request that specifies Prefer: respond-async
and I don't support it. Is there a normative response that I must reply with?
@matdehaast For backwards compatibility, if the receiver doesn't support async, they just handle the request synchronously. The sender can identify which mode the receiver chose based on the status code of the response (200 vs 202) or if the body is non-empty. |
- **Callback URL Header** — _Optional_. Callback URL of the origin connector to send an asynchronous HTTP request with the ILP Fulfill/Reject. | ||
- **Request Id Header** — _Optional_. UUIDv4 to uniquely identify this ILP Prepare, and correlate the corresponding ILP Fulfill/Reject. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't Optional
in async mode, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Changed Request-Id
so it's required, but keeping the Callback-Url
as optional so peers can still use other mechanisms to set/exchange the callback URL.
ILP-over-HTTP has a known issue that if the response containing a Fulfill packet is dropped (e.g. due to a network error), there's no mechanism to retry it, so the balances between the two peers will get out of sync and require manual reconciliation. Also, if the request with an ILP Prepare is dropped, it won't be rejected until the packet expires, which is undesirable since payments may take longer.
This proposes a backwards compatible change to ILP-over-HTTP to separate the Prepare and Fulfill/Reject into two separate HTTP request/responses. Each packet has a corresponding idempotency key enabling clients to safely retry packets without triggering the same side effect multiple times.
The flow looks roughly like:
POST
with the ILP Prepare to Bob.200
status. The response is dropped due to a network error.200
.POST
to Alice with the Fulfill packet and same idempotency key as the ILP Prepare.200
status.200
response and knows no retry is necessary.