Fix bug failing CS-RAA resend order on pending commitment signatures #3149

alecchendev · 2024-06-30T21:05:16Z

Across disconnects we may end up in a situation where we need to send a commitment_signed and then revoke_and_ack. We need to make sure that if the signer is pending for CS but not RAA, we don't screw up the order by sending the RAA first. We defer sending the RAA by setting the flag signer_pending_revoke_and_ack, which will lead to us generating an RAA upon signer_unblocked.

We test this for both the case where we send messages after a channel reestablish, as well as restoring a channel after persisting a monitor update asynchronously.

codecov-commenter · 2024-06-30T21:12:20Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 75.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 90.60%. Comparing base (3ccf064) to head (315193d).
Report is 22 commits behind head on main.

Files	Patch %	Lines
lightning/src/ln/channel.rs	57.89%	5 Missing and 3 partials ⚠️
lightning/src/ln/channelmanager.rs	94.11%	0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3149      +/-   ##
==========================================
+ Coverage   89.81%   90.60%   +0.79%     
==========================================
  Files         121      121              
  Lines       99314   105075    +5761     
  Branches    99314   105075    +5761     
==========================================
+ Hits        89195    95205    +6010     
+ Misses       7514     7313     -201     
+ Partials     2605     2557      -48

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TheBlueMatt

Thanks! Good test, too, though I need to go look at the test coverage changes. I assume this was hit in prod? We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.

TheBlueMatt · 2024-07-02T21:51:15Z

lightning/src/ln/channel.rs

@@ -5475,10 +5476,6 @@ impl<SP: Deref> Channel<SP> where
 				&self.context.channel_id(), if update_fee.is_some() { " update_fee," } else { "" },
 				update_add_htlcs.len(), update_fulfill_htlcs.len(), update_fail_htlcs.len(), update_fail_malformed_htlcs.len());
 		let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) {
-			if self.context.signer_pending_commitment_update {
-				log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update");
-				self.context.signer_pending_commitment_update = false;


Hmmmm, I'm a bit skeptical of this. get_last_commitment_update_for_send is called in:

monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.

channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.

In any case, we should almost certainly add a test for the two above cases.

monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.

I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.

channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.

great catch! yea this fails with my change. changing it back!

lightning/src/ln/async_signer_tests.rs

alecchendev

I assume this was hit in prod?

Yep, have had a couple related issues here in the past

We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.

SGTM!

alecchendev · 2024-07-03T02:12:47Z

lightning/src/ln/channel.rs

@@ -5475,10 +5476,6 @@ impl<SP: Deref> Channel<SP> where
 				&self.context.channel_id(), if update_fee.is_some() { " update_fee," } else { "" },
 				update_add_htlcs.len(), update_fulfill_htlcs.len(), update_fail_htlcs.len(), update_fail_malformed_htlcs.len());
 		let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) {
-			if self.context.signer_pending_commitment_update {
-				log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update");
-				self.context.signer_pending_commitment_update = false;


monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.

I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.

channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.

great catch! yea this fails with my change. changing it back!

alecchendev · 2024-07-03T02:19:40Z

lightning/src/ln/async_signer_tests.rs

+fn do_test_async_commitment_signature_ordering(monitor_update_failure: bool,
+	enable_signer_before_monitor_completion: bool, enable_signer_before_reestablish: bool) {


after the latest additional tests this helper is a bit unwieldy, not sure the best way to address it. I could copy this test 4 times and get rid of the conditionals, but it adds a lot of filler...

yaknow actually i realized for what these are testing it isn't necessary to include them in this test (they don't care about the CS-RAA ordering, just that the CS signer pending flag is cleared at the right time), i'll just move them to their own separate tests tomorrow

okay, fixed this, moved them into the peer disconnection test

optout21

LGTM

optout21 · 2024-07-05T13:25:28Z

lightning/src/ln/channel.rs

 			funding_signed,
 			channel_ready,
+			order: self.context.resend_order.clone(),


Nit: clone() is not needed on order

Suggested change

order: self.context.resend_order.clone(),

order: self.context.resend_order,

hmm, i think it actually does require clone since RAACommitmentOrder doesn't implement Copy?

OK. RAACommitmentOrder is a primitive enum, but it indeed does not have Copy trait. I tried it without and it worked, but to be safe for future changes, either the clone is needed or the enum should be marked explicitly Copy. It's fine with the clone.

lightning/src/ln/channel.rs

TheBlueMatt

LGTM, feel free to squash the fixups.

TheBlueMatt · 2024-07-08T17:52:56Z

lightning/src/ln/async_signer_tests.rs

+	let bs_second_commitment_signed = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id());
+	check_added_monitors!(nodes[1], 1);
+
+	// The rest of this is boilerplate for resolving the previous state.


At least some of this may be replaceable with one of the inscruitable commitment_signed_dance variants.

i feared this day would come...

I tried my best, but didn't get very far, it seems this is a little different from the normal exchange so i'm not sure how much more time i should spend trying:

nodes0 nodes1 uah -> cs -> raa -> <- raa <- cs cs -> raa -> <- raa

I think the first few messages (after the uah) could be, but if you fought with it a bit I'm not gonna nitpick.

TheBlueMatt · 2024-07-09T13:48:05Z

Merging as the only changes since @optout21's "LGTM" are addressing his feedback.

alecchendev force-pushed the 2024-06-async-sign-cs-raa-order branch from fe7afcf to 536b6b4 Compare June 30, 2024 22:30

This was referenced Jun 30, 2024

Handle fallible per commitment point for RAA #3150

Merged

Handle fallible per commitment point in channel establishment #3109

Open

Handle fallible commitment secret #3152

Merged

TheBlueMatt reviewed Jul 2, 2024

View reviewed changes

Allow sending RAA upon unblocking the signer

3775edd

alecchendev commented Jul 3, 2024

View reviewed changes

alecchendev force-pushed the 2024-06-async-sign-cs-raa-order branch from 536b6b4 to 119b703 Compare July 3, 2024 02:16

alecchendev commented Jul 3, 2024

View reviewed changes

alecchendev force-pushed the 2024-06-async-sign-cs-raa-order branch 2 times, most recently from ca37547 to 221ef18 Compare July 3, 2024 21:11

optout21 reviewed Jul 5, 2024

View reviewed changes

TheBlueMatt previously approved these changes Jul 8, 2024

View reviewed changes

Fix bug failing CS-RAA resend order on pending commitment signatures

315193d

alecchendev dismissed TheBlueMatt’s stale review via 315193d July 9, 2024 00:35

alecchendev force-pushed the 2024-06-async-sign-cs-raa-order branch from 221ef18 to 315193d Compare July 9, 2024 00:35

TheBlueMatt approved these changes Jul 9, 2024

View reviewed changes

TheBlueMatt merged commit f48a273 into lightningdevkit:main Jul 9, 2024
12 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug failing CS-RAA resend order on pending commitment signatures #3149

Fix bug failing CS-RAA resend order on pending commitment signatures #3149

alecchendev commented Jun 30, 2024

codecov-commenter commented Jun 30, 2024 •

edited

Loading

TheBlueMatt left a comment

TheBlueMatt Jul 2, 2024

alecchendev Jul 3, 2024

alecchendev left a comment

alecchendev Jul 3, 2024

alecchendev Jul 3, 2024

alecchendev Jul 3, 2024

alecchendev Jul 3, 2024

optout21 left a comment

optout21 Jul 5, 2024

alecchendev Jul 9, 2024

optout21 Jul 9, 2024

TheBlueMatt left a comment

TheBlueMatt Jul 8, 2024

alecchendev Jul 9, 2024

TheBlueMatt Jul 9, 2024

TheBlueMatt commented Jul 9, 2024

		fn do_test_async_commitment_signature_ordering(monitor_update_failure: bool,
		enable_signer_before_monitor_completion: bool, enable_signer_before_reestablish: bool) {

	order: self.context.resend_order.clone(),
	order: self.context.resend_order,

Fix bug failing CS-RAA resend order on pending commitment signatures #3149

Fix bug failing CS-RAA resend order on pending commitment signatures #3149

Conversation

alecchendev commented Jun 30, 2024

codecov-commenter commented Jun 30, 2024 • edited Loading

Codecov Report

TheBlueMatt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alecchendev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

optout21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Jul 9, 2024

codecov-commenter commented Jun 30, 2024 •

edited

Loading