Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug failing CS-RAA resend order on pending commitment signatures #3149

Merged

Conversation

alecchendev
Copy link
Contributor

Across disconnects we may end up in a situation where we need to send a commitment_signed and then revoke_and_ack. We need to make sure that if the signer is pending for CS but not RAA, we don't screw up the order by sending the RAA first. We defer sending the RAA by setting the flag signer_pending_revoke_and_ack, which will lead to us generating an RAA upon signer_unblocked.

We test this for both the case where we send messages after a channel reestablish, as well as restoring a channel after persisting a monitor update asynchronously.

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 75.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 90.60%. Comparing base (3ccf064) to head (315193d).
Report is 22 commits behind head on main.

Files Patch % Lines
lightning/src/ln/channel.rs 57.89% 5 Missing and 3 partials ⚠️
lightning/src/ln/channelmanager.rs 94.11% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3149      +/-   ##
==========================================
+ Coverage   89.81%   90.60%   +0.79%     
==========================================
  Files         121      121              
  Lines       99314   105075    +5761     
  Branches    99314   105075    +5761     
==========================================
+ Hits        89195    95205    +6010     
+ Misses       7514     7313     -201     
+ Partials     2605     2557      -48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good test, too, though I need to go look at the test coverage changes. I assume this was hit in prod? We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.

@@ -5475,10 +5476,6 @@ impl<SP: Deref> Channel<SP> where
&self.context.channel_id(), if update_fee.is_some() { " update_fee," } else { "" },
update_add_htlcs.len(), update_fulfill_htlcs.len(), update_fail_htlcs.len(), update_fail_malformed_htlcs.len());
let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) {
if self.context.signer_pending_commitment_update {
log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update");
self.context.signer_pending_commitment_update = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I'm a bit skeptical of this. get_last_commitment_update_for_send is called in:

  • monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.
  • channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.

In any case, we should almost certainly add a test for the two above cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.

I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.

channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.

great catch! yea this fails with my change. changing it back!

lightning/src/ln/async_signer_tests.rs Outdated Show resolved Hide resolved
lightning/src/ln/async_signer_tests.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@alecchendev alecchendev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was hit in prod?

Yep, have had a couple related issues here in the past

We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.

SGTM!

@@ -5475,10 +5476,6 @@ impl<SP: Deref> Channel<SP> where
&self.context.channel_id(), if update_fee.is_some() { " update_fee," } else { "" },
update_add_htlcs.len(), update_fulfill_htlcs.len(), update_fail_htlcs.len(), update_fail_malformed_htlcs.len());
let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) {
if self.context.signer_pending_commitment_update {
log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update");
self.context.signer_pending_commitment_update = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.

I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.

channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.

great catch! yea this fails with my change. changing it back!

@alecchendev alecchendev force-pushed the 2024-06-async-sign-cs-raa-order branch from 536b6b4 to 119b703 Compare July 3, 2024 02:16
Comment on lines 360 to 361
fn do_test_async_commitment_signature_ordering(monitor_update_failure: bool,
enable_signer_before_monitor_completion: bool, enable_signer_before_reestablish: bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the latest additional tests this helper is a bit unwieldy, not sure the best way to address it. I could copy this test 4 times and get rid of the conditionals, but it adds a lot of filler...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaknow actually i realized for what these are testing it isn't necessary to include them in this test (they don't care about the CS-RAA ordering, just that the CS signer pending flag is cleared at the right time), i'll just move them to their own separate tests tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, fixed this, moved them into the peer disconnection test

@alecchendev alecchendev force-pushed the 2024-06-async-sign-cs-raa-order branch 2 times, most recently from ca37547 to 221ef18 Compare July 3, 2024 21:11
Copy link
Contributor

@optout21 optout21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

funding_signed,
channel_ready,
order: self.context.resend_order.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: clone() is not needed on order

Suggested change
order: self.context.resend_order.clone(),
order: self.context.resend_order,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think it actually does require clone since RAACommitmentOrder doesn't implement Copy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. RAACommitmentOrder is a primitive enum, but it indeed does not have Copy trait. I tried it without and it worked, but to be safe for future changes, either the clone is needed or the enum should be marked explicitly Copy. It's fine with the clone.

lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
TheBlueMatt
TheBlueMatt previously approved these changes Jul 8, 2024
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to squash the fixups.

let bs_second_commitment_signed = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id());
check_added_monitors!(nodes[1], 1);

// The rest of this is boilerplate for resolving the previous state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least some of this may be replaceable with one of the inscruitable commitment_signed_dance variants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feared this day would come...

I tried my best, but didn't get very far, it seems this is a little different from the normal exchange so i'm not sure how much more time i should spend trying:

nodes0    nodes1
uah ->
cs ->
raa ->
<- raa
<- cs
cs ->
raa ->
<- raa

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first few messages (after the uah) could be, but if you fought with it a bit I'm not gonna nitpick.

@TheBlueMatt
Copy link
Collaborator

Merging as the only changes since @optout21's "LGTM" are addressing his feedback.

@TheBlueMatt TheBlueMatt merged commit f48a273 into lightningdevkit:main Jul 9, 2024
12 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants