Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block monitor updates to ensure preimages are in each MPP part #3120

Merged
merged 5 commits into from
Jul 8, 2024

Conversation

TheBlueMatt
Copy link
Collaborator

If we claim an MPP payment and only persist some of the
ChannelMonitorUpdates which include the preimage prior to
shutting down, we may be in a state where some of our
ChannelMonitors have the preimage for a payment while others do
not.

This, it turns out, is actually mostly safe - on startup
ChanelManager will detect there's a payment it has as unclaimed
but there's a corresponding payment preimage in a ChannelMonitor
and go claim the other MPP parts. This works so long as the
ChannelManager has been persisted after the payment has been
received but prior to the PaymentClaimable event being processed
(and the claim itself occurring). This is not always true and
certainly not required on our API, but our
lightning-background-processor today does persist prior to event
handling so is generally true subject to some race conditions.

In order to address this we need to use copy payment preimages
across channels irrespective of the ChannelManager's payment
state, but this introduces another wrinkle - if one channel makes
substantial progress while other channel(s) are still waiting to
get the payment preimage in ChannelMonitor(s) while the
ChannelManager hasn't been persisted after the payment was
received, we may end up without the preimage on disk.

Here, we address this issue by using the new
RAAMonitorUpdateBlockingAction variant for this specific case. We
block persistence of an RAA ChannelMonitorUpdate which may remove
the preimage from disk until all channels have had the preimage
added to their ChannelMonitor.

We do this only in-memory (and not on disk) as we can recreate this
blocker during the startup re-claim logic.

This will enable us to claim MPP parts without using the
ChannelManager's payment state in later work.

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 89.68750% with 33 lines in your changes missing coverage. Please review.

Project coverage is 90.54%. Comparing base (88e1b56) to head (2d1306a).
Report is 25 commits behind head on main.

Files Patch % Lines
lightning/src/ln/channelmanager.rs 85.90% 22 Missing and 9 partials ⚠️
lightning/src/ln/chanmon_update_fail_tests.rs 97.95% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3120      +/-   ##
==========================================
+ Coverage   89.83%   90.54%   +0.71%     
==========================================
  Files         121      121              
  Lines       98900   106185    +7285     
  Branches    98900   106185    +7285     
==========================================
+ Hits        88847    96150    +7303     
+ Misses       7457     7419      -38     
- Partials     2596     2616      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jkczyz jkczyz self-requested a review June 17, 2024 14:37
@valentinewallace valentinewallace self-requested a review June 17, 2024 21:10
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to take a more thorough look at the last two commits still.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
htlc_id: htlc_id.0.unwrap(),
}))
},
// 1 is ClaimedMPPPayment and is handled in the general odd handling below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expand impl_writeable_tlv_based_enum_upgradable with an optional section for variants to ignore instead of implementing this by hand?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, that wasn't as hard as i thought.

@valentinewallace
Copy link
Contributor

Needs rebase + not building for me locally

@TheBlueMatt
Copy link
Collaborator Author

Rebased and fixed the issue.

lightning/src/ln/chanmon_update_fail_tests.rs Outdated Show resolved Hide resolved
lightning/src/ln/chanmon_update_fail_tests.rs Show resolved Hide resolved
lightning/src/ln/chanmon_update_fail_tests.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being slow on this. I'm not as familiar with this code. Still need to look at the test.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
Comment on lines +6669 to +6749
if let Some((counterparty_node_id, chan_id, htlc_id, claim_ptr)) = pending_mpp_claim {
let per_peer_state = self.per_peer_state.read().unwrap();
per_peer_state.get(&counterparty_node_id).map(|peer_state_mutex| {
let mut peer_state = peer_state_mutex.lock().unwrap();
let blockers_entry = peer_state.actions_blocking_raa_monitor_updates.entry(chan_id);
if let btree_map::Entry::Occupied(mut blockers) = blockers_entry {
blockers.get_mut().retain(|blocker|
if let &RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim } = &blocker {
if *pending_claim == claim_ptr {
let mut pending_claim_state_lock = pending_claim.0.lock().unwrap();
let pending_claim_state = &mut *pending_claim_state_lock;
pending_claim_state.channels_without_preimage.retain(|(cp, outp, cid, hid)| {
if *cp == counterparty_node_id && *cid == chan_id && *hid == htlc_id {
pending_claim_state.channels_with_preimage.push((*cp, *outp, *cid));
false
} else { true }
});
if pending_claim_state.channels_without_preimage.is_empty() {
for (cp, outp, cid) in pending_claim_state.channels_with_preimage.iter() {
freed_channels.push((*cp, *outp, *cid, blocker.clone()));
}
}
!pending_claim_state.channels_without_preimage.is_empty()
} else { true }
} else { true }
);
if blockers.get().is_empty() {
blockers.remove();
}
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid rustfmt from mangling this later, might be worth turning this into some chained calls. Might flatten it a bit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't actually see much room to flatten it :/ Most of the ifs are pattern-matching, not just checking and we have more logic than can just be done with a filter :/

Comment on lines +6495 to +6555
let peer_state_mutex = per_peer_state.entry(counterparty_node_id).or_insert_with(||
Mutex::new(PeerState {
channel_by_id: new_hash_map(),
inbound_channel_request_by_id: new_hash_map(),
latest_features: InitFeatures::empty(),
pending_msg_events: Vec::new(),
in_flight_monitor_updates: BTreeMap::new(),
monitor_update_blocked_actions: BTreeMap::new(),
actions_blocking_raa_monitor_updates: BTreeMap::new(),
is_connected: false,
}));
let mut peer_state = peer_state_mutex.lock().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, in what case would we not have a PeerState for a counterparty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're disconnected from the peer and we have no more channels, we may have pruned the PeerState entry.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay

lightning/src/ln/channelmanager.rs Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch 2 times, most recently from 0045ded to de40c83 Compare June 25, 2024 17:48
@TheBlueMatt
Copy link
Collaborator Author

Rebased and addressed comments.

Comment on lines 846 to +851
// Note that FreeOtherChannelImmediately should never be written - we were supposed to free
// *immediately*. However, for simplicity we implement read/write here.
(1, FreeOtherChannelImmediately) => {
(0, downstream_counterparty_node_id, required),
(2, downstream_funding_outpoint, required),
(4, blocking_action, required),
(4, blocking_action, upgradable_required),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use unread_variants for this now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe? Note that this isn't an encoding change, anyway - it just can sometimes short-circuit and return None for the whole enum.

Comment on lines 6307 to 6344
let (source_claim_pairs, pending_claim_ptr_opt) = if sources.len() > 1 {
let mut pending_claims = PendingMPPClaim {
channels_without_preimage: Vec::new(),
channels_with_preimage: Vec::new(),
};
let (source_claim_pairs, channels_without_preimage) = sources.into_iter().filter_map(|htlc| {
if let Some(cp_id) = htlc.prev_hop.counterparty_node_id {
let htlc_id = htlc.prev_hop.htlc_id;
let chan_id = htlc.prev_hop.channel_id;
let chan_outpoint = htlc.prev_hop.outpoint;
Some((
(htlc, Some((cp_id, chan_id, htlc_id))),
(cp_id, chan_outpoint, chan_id, htlc_id),
))
} else {
None
}
}).unzip::<_, _, Vec<_>, _>();
pending_claims.channels_without_preimage = channels_without_preimage;
(source_claim_pairs, Some(Arc::new(Mutex::new(pending_claims))))
} else {
(sources.into_iter().map(|htlc| (htlc, None)).collect(), None)
};
for (htlc, mpp_claim) in source_claim_pairs {
let mut pending_mpp_claim = None;
let pending_claim_ptr = pending_claim_ptr_opt.as_ref().map(|pending_claim| {
pending_mpp_claim = mpp_claim.map(|(cp_id, chan_id, htlc_id)|
(cp_id, chan_id, htlc_id, PendingMPPClaimPointer(Arc::clone(pending_claim)))
);
RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)),
}
});
self.claim_funds_from_hop(
htlc.prev_hop, payment_preimage,
|_, definitely_duplicate| {
debug_assert!(!definitely_duplicate, "We shouldn't claim duplicatively from a payment");
Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash })
(Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash, pending_mpp_claim }), pending_claim_ptr)
Copy link
Contributor

@valentinewallace valentinewallace Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find writing it this way a lot less confusing rather than using the intermediate source_claim_pairs, getting rid of the remaining muts and some renames:

			let pending_mpp_claim_opt = if sources.len() > 1 {
				let channels_without_preimage = sources.iter().filter_map(|htlc| {
					htlc.prev_hop.counterparty_node_id.map(|cp_id| {
						(cp_id, htlc.prev_hop.outpoint, htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id)
					})
				}).collect();
				Some(Arc::new(Mutex::new(PendingMPPClaim {
					channels_without_preimage,
					channels_with_preimage: Vec::new()
				})))
			} else { None };
			for htlc in sources.into_iter() {
				let raa_blocker = pending_mpp_claim_opt.as_ref().map(|pending_claim| {
					RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
						pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)),
					}
				});
				let pending_mpp_claim = pending_mpp_claim_opt.as_ref().map(|pending_claim| {
					(htlc.prev_hop.counterparty_node_id.expect("We checked this"), htlc.prev_hop.channel_id,
					 htlc.prev_hop.htlc_id, PendingMPPClaimPointer(Arc::clone(pending_claim)))
				});
				self.claim_funds_from_hop(
					htlc.prev_hop, payment_preimage,
					|_, definitely_duplicate| {
						debug_assert!(!definitely_duplicate, "We shouldn't claim duplicatively from a payment");
						(Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash, pending_mpp_claim }), raa_blocker)
					}
				);
			}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, indeed that is much more readable. I further tweaked the variable names slightly.

@valentinewallace
Copy link
Contributor

LGTM after outstanding feedback and Jeff takes another look. Would be good with a squash.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, feel free to squash. Looks good aside from some minor comments.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
blockers.get_mut().retain(|blocker|
if let &RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim } = &blocker {
if *pending_claim == claim_ptr {
let mut pending_claim_state_lock = pending_claim.0.lock().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any lock order concerns? I guess the nesting defines what is allowable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm? I'm not sure I see any lockorder concerns here, we always take the new inner lock as the last lock, which is always safe. More generally, our lockorder violation detection is quite good, so as long as we hit a lockorder inversion once in tests we should fail.

@@ -6324,7 +6361,9 @@ where
}
}

fn claim_funds_from_hop<ComplFunc: FnOnce(Option<u64>, bool) -> Option<MonitorUpdateCompletionAction>>(
fn claim_funds_from_hop<
ComplFunc: FnOnce(Option<u64>, bool) -> (Option<MonitorUpdateCompletionAction>, Option<RAAMonitorUpdateBlockingAction>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like (None, Some) is not possible. Should we consider a different nesting or possibly a three-variant enum?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good catch, not sure it matters though? It complicates the code in the on-chain-claim branch a tiny bit, but more importantly they're logically distinct concepts, and there's nothing wrong with there being an RAA blocker without an event, we just don't have a use for it atm.

In some cases, we have variants of an enum serialized using
`impl_writeable_tlv_based_enum_upgradable` which we don't want to
write/read. Here we add support for such variants by writing them
using the (odd) type 255 without any contents and using
`MaybeReadable` to decline to read them.
Because we track pending `ChannelMonitorUpdate`s per-peer, we
really need to know what peer an HTLC came from when we go to claim
it on-chain, allowing us to look up any blocked actions in the
`PeerState`. While we could do this by moving the blocked actions
to some global "closed-channel update queue", its simpler to do it
this way.

While this trades off `ChannelMonitorUpdate` size somewhat (the
`HTLCSource` is included in many of them), which we should be
sensitive to, it will also allow us to (eventually) remove the
SCID -> peer + channel_id lookups we do when claiming or failing
today, which are somewhat annoying to deal with.
If we claim an MPP payment and only persist some of the
`ChannelMonitorUpdate`s which include the preimage prior to
shutting down, we may be in a state where some of our
`ChannelMonitor`s have the preimage for a payment while others do
not.

This, it turns out, is actually mostly safe - on startup
`ChanelManager` will detect there's a payment it has as unclaimed
but there's a corresponding payment preimage in a `ChannelMonitor`
and go claim the other MPP parts. This works so long as the
`ChannelManager` has been persisted after the payment has been
received but prior to the `PaymentClaimable` event being processed
(and the claim itself occurring). This is not always true and
certainly not required on our API, but our
`lightning-background-processor` today does persist prior to event
handling so is generally true subject to some race conditions.

In order to address this race we need to use copy payment preimages
across channels irrespective of the `ChannelManager`'s payment
state, but this introduces another wrinkle - if one channel makes
substantial progress while other channel(s) are still waiting to
get the payment preimage in `ChannelMonitor`(s) while the
`ChannelManager` hasn't been persisted after the payment was
received, we may end up without the preimage on disk.

Here, we address this issue with a new
`RAAMonitorUpdateBlockingAction` variant for this specific case. We
block persistence of an RAA `ChannelMonitorUpdate` which may remove
the preimage from disk until all channels have had the preimage
added to their `ChannelMonitor`.

We do this only in-memory (and not on disk) as we can recreate this
blocker during the startup re-claim logic.

This will enable us to claim MPP parts without using the
`ChannelManager`'s payment state in later work.
If we claim an MPP payment and only persist some of the
`ChannelMonitorUpdate`s which include the preimage prior to
shutting down, we may be in a state where some of our
`ChannelMonitor`s have the preimage for a payment while others do
not.

This, it turns out, is actually mostly safe - on startup
`ChanelManager` will detect there's a payment it has as unclaimed
but there's a corresponding payment preimage in a `ChannelMonitor`
and go claim the other MPP parts. This works so long as the
`ChannelManager` has been persisted after the payment has been
received but prior to the `PaymentClaimable` event being processed
(and the claim itself occurring). This is not always true and
certainly not required on our API, but our
`lightning-background-processor` today does persist prior to event
handling so is generally true subject to some race conditions.

In order to address this we need to use copy payment preimages
across channels irrespective of the `ChannelManager`'s payment
state, but this introduces another wrinkle - if one channel makes
substantial progress while other channel(s) are still waiting to
get the payment preimage in `ChannelMonitor`(s) while the
`ChannelManager` hasn't been persisted after the payment was
received, we may end up without the preimage on disk.

Here, we address this issue by using the new
`RAAMonitorUpdateBlockingAction` variant for this specific case. We
block persistence of an RAA `ChannelMonitorUpdate` which may remove
the preimage from disk until all channels have had the preimage
added to their `ChannelMonitor`.

We do this only in-memory (and not on disk) as we can recreate this
blocker during the startup re-claim logic.

This will enable us to claim MPP parts without using the
`ChannelManager`'s payment state in later work.
@TheBlueMatt
Copy link
Collaborator Author

Squashed with a trivial cleanup:

$ git diff-tree -U3 bd1a8222 2d1306a3
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index f9bf2b963..67159ef52 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -6321,14 +6321,14 @@ where
 				None
 			};
 			for htlc in sources {
-				let this_mpp_claim = pending_mpp_claim_ptr_opt.as_ref().map(|pending_mpp_claim|
+				let this_mpp_claim = pending_mpp_claim_ptr_opt.as_ref().and_then(|pending_mpp_claim|
 					if let Some(cp_id) = htlc.prev_hop.counterparty_node_id {
 						let claim_ptr = PendingMPPClaimPointer(Arc::clone(pending_mpp_claim));
 						Some((cp_id, htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id, claim_ptr))
 					} else {
 						None
 					}
-				).flatten();
+				);
 				let raa_blocker = pending_mpp_claim_ptr_opt.as_ref().map(|pending_claim| {
 					RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
 						pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)),

@TheBlueMatt TheBlueMatt merged commit bcacf85 into lightningdevkit:main Jul 8, 2024
12 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants