Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer storage feature #2943

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

adi2011
Copy link

@adi2011 adi2011 commented Mar 17, 2024

This would enable nodes to distribute small blobs among their peers, which could be retrieved to resume or force-close the channel operation upon data corruption.

Elaborated in detail here

It's far from complete right now, but please let me know if I am making any mistakes because I am very new to both Rust and LDK :)

@adi2011 adi2011 force-pushed the PeerStorageFeature branch 3 times, most recently from b798c8e to c783fd3 Compare March 17, 2024 20:10
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
if let Some(funding_txo) = chan.funding_txo {
found_funded_chan = true;
let peer_storage_update = ChannelMonitorUpdate {
update_id: CLOSED_CHANNEL_UPDATE_ID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use a real update_id fetched (and incremented) from the channel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't see any helper in ChannelContext to fetch_and_increment_update_id(), should I create one?


let mut found_funded_chan = false;
for chan in &sorted_chan_info {
if let Some(funding_txo) = chan.funding_txo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also require the channel have confirmations (or be 0conf).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am filtering ChannelPhase::Funded, that'd cover this I think?

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
@adi2011
Copy link
Author

adi2011 commented Apr 1, 2024

Thanks for the review @TheBlueMatt, I was afk last week. I will work on these changes in this week.

@adi2011 adi2011 force-pushed the PeerStorageFeature branch 2 times, most recently from b7f4bbc to 6ecf1b8 Compare April 14, 2024 22:16
coderabbitai[bot]
coderabbitai bot previously approved these changes Apr 14, 2024
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry I missed your review request. I generally don't see them as Github doesn't generate email notifications for them.

lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
let mut peer_storage = VecWriter(Vec::new());
self.our_peer_storage.read().unwrap().write(&mut peer_storage).unwrap();
let mut encrypted_blob = vec![0;peer_storage.0.len() + 16];
self.inbound_payment_key.encrypt_our_peer_storage(&mut encrypted_blob, 0u64, b"", &peer_storage.0[..]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define a new key for this, the inbound_payment_key may be shared across multiple nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I defined a new method inside NodeManager, get_peer_storage_key. It derives a key from node_secret through the KeysManager.

pub fn get_encrypted_our_peer_storage(&self) -> Vec<u8> {
let mut peer_storage = VecWriter(Vec::new());
self.our_peer_storage.read().unwrap().write(&mut peer_storage).unwrap();
let mut encrypted_blob = vec![0;peer_storage.0.len() + 16];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We shouldn't need a new vec here, we should be able to encrypt in-place.

@@ -2095,6 +2160,7 @@ where
entropy_source: ES,
node_signer: NS,
signer_provider: SP,
our_peer_storage: FairRwLock<OurPeerStorage>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see where this is updated at runtime? Presumably we need to be able to update each channel when it receives a new RAA

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this gets updated everytime we send a LatestCounterpartyCommitmentTXInfo.

best_block,
counterparty_node_id: Some(counterparty_node_id),
initial_counterparty_commitment_info: None,
balances_empty_height: None,
})
}

pub(crate) fn new_stub(secp_ctx: Secp256k1<secp256k1::All>, stub_channel: &StubChannel, best_block: BestBlock, keys: Signer, funding_info_scriptbuf: ScriptBuf) -> ChannelMonitor<Signer> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wonder if we shouldn't move StubChannel to the chain module and refactor ChannelMonitor methods to operate on either a full ChannelMonitor or a StubChannel? Its at least something to explore as we go to test this logic and see if its simpler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we'd need the chain module to detect the funding spent and to sweep funds, maybe create new methods inside chain module to handle stubchannel?
Watch_stub?

@adi2011 adi2011 force-pushed the PeerStorageFeature branch 3 times, most recently from 1317384 to d609779 Compare July 14, 2024 17:46
@adi2011 adi2011 marked this pull request as ready for review July 14, 2024 17:48
@adi2011 adi2011 force-pushed the PeerStorageFeature branch 3 times, most recently from 4a428cb to 31b4912 Compare July 17, 2024 09:15
@adi2011 adi2011 requested a review from TheBlueMatt July 20, 2024 22:08
@adi2011 adi2011 force-pushed the PeerStorageFeature branch 2 times, most recently from a7a12b0 to 6c893d6 Compare July 20, 2024 22:20
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself wondering what the right UX here is - we generally assume that users will not lose ChannelMonitor data, no matter what, and this seems mostly targeted at addressing the case where a user somehow loses a single (or a few) ChannelMonitors for closed channels (if the channels are open, we'll still fail to load).

I think really this feature is useful in one of two cases -

(a) we can use it to detect that we're actually running with stale state by trusting any one of our peers to tell us that, rather than trusting the peer with whom we have a stale channel to tell us our channel is state. This requires much less data (just the latest commitment numbers for each channel) and is a quick and cheap check in ChannelManager which we can respond to by simply panic'ing.

(b) Handling the case where we've lost all our (latest) data and we want to recover from just our cryptographic keys (and maybe some stale ChannelMonitors). In this case, we really don't want to start with a ChannelManager at all, we want to have some new RecoveryPeerConnector struct that handles peer connections and processes the backup data we're adding here.

I think both are relatively easy changes to this PR, though.

let entry = match stub_monitors.entry(funding_outpoint) {
hash_map::Entry::Occupied(_) => {
log_error!(logger, "Failed to add new channel data: channel monitor for given outpoint is already present");
return Err(());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should see if this stub monitor has any new information that the previous one did not and update, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Fixed it.

Self::from_impl(ChannelMonitorImpl {
latest_update_id: STUB_CHANNEL_UPDATE_IDENTIFIER,
commitment_transaction_number_obscure_factor: stub_channel.obscure_factor,
destination_script: ScriptBuf::new(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We absolutely need a destination_script to know where to send money.

funding_redeemscript: ScriptBuf::new(),
channel_value_satoshis: stub_channel.channel_value_stoshis,
their_cur_per_commitment_points: stub_channel.their_cur_per_commitment_points,
on_holder_tx_csv: 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to know when we can claim funds (and I think also build witnesses)?

@@ -2232,7 +2237,8 @@ where
entropy_source: ES,
node_signer: NS,
signer_provider: SP,

our_peer_storage: FairRwLock<OurPeerStorage>,
peer_storage: Mutex<HashMap<PublicKey, Vec<u8>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in a PeerHolder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can shift it to PeerState, but i think handling it here is simpler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it here is another thing for us to have to clean up when we no longer have channels. Putting it with the rest of the peer state consolidates logic and simplifies things.

@@ -7432,6 +7449,77 @@ where
}
}

fn internal_peer_storage(&self, counterparty_node_id: &PublicKey, msg: &msgs::PeerStorageMessage) {
let per_peer_state = self.per_peer_state.write().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need a write lock?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, thanks for pointing this out :)


let mut res = vec![0; msg.data.len() - 16];

match our_peer_storage.decrypt_our_peer_storage(&mut res, msg.data.as_slice(), self.our_peerstorage_encryption_key) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there's gotta be a more efficient way to handle this message. We're expected to receive it every time we connect, and peers may well send it more often. The ChainMonitor can be smarter here, which may be the best way to go about it - pass the whole blob over the fence and let it decode them/check if it has corresponding ChannelMonitors for everything before deriving the new keys, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better I think. So I'll write a function which processes the blob and only create the keys if we need to panic and persist the channel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to think instead of doing this we should just compare the decrypted contents with the state of our live channels. Sure, its not entirely robust, but it should be pretty good (and, notably, means at least our current channels are safe to update), but it also avoids changing the interface with ChainMonitor, which would be really nice.

@@ -9978,6 +10108,16 @@ where
let _ = handle_error!(self, self.internal_funding_signed(counterparty_node_id, msg), *counterparty_node_id);
}

fn handle_peer_storage(&self, counterparty_node_id: &PublicKey, msg: &msgs::PeerStorageMessage) {
let _persistence_guard = PersistenceNotifierGuard::notify_on_drop(self);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let _persistence_guard = PersistenceNotifierGuard::notify_on_drop(self);
// There's no reason to cause a full persistence just because a peer updated their storage, and it will
// never create new events, so just skip all notification.
let _persistence_guard = PersistenceNotifierGuard::optionally_notify(self, || NotifyOption::SkipPersistNoEvents);

let pending_msg_events = &mut peer_state.pending_msg_events;
let peer_storage = self.peer_storage.lock().unwrap().get(counterparty_node_id).unwrap_or(&Vec::<u8>::new()).clone();

if peer_storage.len() > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont' see why we should only send if the length is > 0, shouldn't we send if there's any value in the map at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this. Thanks :)

(peer_storage.len() as u64).write(writer)?;
for (node_id, peer_data) in peer_storage.iter() {
node_id.write(writer)?;
peer_data.write(writer)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the encoding of ChannelManager such that we cannot upgrade/downgrade. Instead, new data have to be written as TLVs.

channel.context.get_commitment_txn_number_obscure_factor(),
new_hash_map(),
None,
channel.context.channel_transaction_parameters.channel_type_features.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, why are we creating a stub channel for all of our existing channels on each startup? They should already exist as full ChannelMonitors so this is redundant, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to build our own latest backup which would be distributed to our peers.

@@ -2232,7 +2237,8 @@ where
entropy_source: ES,
node_signer: NS,
signer_provider: SP,

our_peer_storage: FairRwLock<OurPeerStorage>,
peer_storage: Mutex<HashMap<PublicKey, Vec<u8>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it here is another thing for us to have to clean up when we no longer have channels. Putting it with the rest of the peer state consolidates logic and simplifies things.

/// - [`ChannelMessageHandler`] to handle off-chain channel activity from peers
/// - [`MessageSendEventsProvider`] to similarly send such messages to peers
///
pub struct FundRecoverer<SP: Deref, L:Deref, M: Deref>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets put this in a new module/file.

@@ -305,6 +305,55 @@ pub trait Watch<ChannelSigner: EcdsaChannelSigner> {
/// For details on asynchronous [`ChannelMonitor`] updating and returning
/// [`MonitorEvent::Completed`] here, see [`ChannelMonitorUpdateStatus::InProgress`].
fn release_pending_monitor_events(&self) -> Vec<(OutPoint, ChannelId, Vec<MonitorEvent>, Option<PublicKey>)>;

/// Watches a dummy channel identified by `funding_txo` using `monitor`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're using a different struct from ChannelManager, IMO we should use a different persistence trait.

SP::Target: SignerProvider,
L::Target: Logger
{
fn handle_open_channel(&self, _their_node_id: &PublicKey, _msg: &msgs::OpenChannel) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reject open channels (see how IgnoringMessageHandler deals with different messages and ~copy that).


let mut res = vec![0; msg.data.len() - 16];

match our_peer_storage.decrypt_our_peer_storage(&mut res, msg.data.as_slice(), self.our_peerstorage_encryption_key) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to think instead of doing this we should just compare the decrypted contents with the state of our live channels. Sure, its not entirely robust, but it should be pretty good (and, notably, means at least our current channels are safe to update), but it also avoids changing the interface with ChainMonitor, which would be really nice.

@@ -8272,9 +8565,15 @@ where
} else { false };
let (htlcs_to_fail, monitor_update_opt) = try_chan_phase_entry!(self,
chan.revoke_and_ack(&msg, &self.fee_estimator, &&logger, mon_update_blocked), chan_phase_entry);

let mut our_peer_storage = self.our_peer_storage.write().unwrap();
let _ = our_peer_storage.provide_secret(chan.context.channel_id(), chan.get_cur_counterparty_commitment_transaction_number() + 1, msg.per_commitment_secret);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and all the other update methods) is actually too early. We allow ChannelMonitorUpdates to be persisted asynchronously, which means we could (a) get a new ChannelMonitorUpdate, updating our peer state, (b) send that new peer state to our peers, (c) restart without finishing persisting the ChannelMonitorUpdate (or ChannelManager), then (d) on restart we're in a perfectly fine state (we haven't sent the new channel messages to our peer, only the new state) but we'll still think we're stale and panic.

Instead, we need to process changes to the channel(s) when we complete ChannelMonitorUpdates (removing them from the pending set in Channel via monitor_updating_restored in channelmanager.rs's handle_monitor_update_completion).

@adi2011 adi2011 force-pushed the PeerStorageFeature branch 2 times, most recently from 97dfdde to 5bb20de Compare September 25, 2024 08:13
Aditya Sharma added 11 commits September 25, 2024 13:48
…erytime a LatestCounterpartyCommitmentTxn update is sent. It would be encrypted and sent to our peers.
… or decrpt the peerstorage and send PeerStorage on every RAA and upon reconnection.
… so that we can just send a BogusChannelReestablish and close all the StubChannelMonitors and sweep the funds from the events.
}

log_trace!(logger, "Received Peer Storage from {}", log_pubkey!(counterparty_node_id));
peer_state.peer_storage = msg.data.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually should be willing to store a full 64KiB per peer. Sadly, LDK resilvers ChannelManager very often (something we're working on reducing, but we're a ways from being able to change it), and currently store only ~ 1KiB per channel. If we allow all our peers to store 64KiB, we may ~50x our ChannelManager size, causing a very substantial increase in I/O. For now, we can just limit the max size we accept (maybe 1KiB?) and ignore larger messages, but in a followup we can look at some other I/O interface for storing these in a separate K-V entry.

/// This update ID is used inside [`ChannelMonitorImpl`] to recognise
/// that we're dealing with a [`StubChannelMonitor`]. Since we require some
/// exceptions while dealing with it.
pub const STUB_CHANNEL_UPDATE_IDENTIFIER: u64 = CLOSED_CHANNEL_UPDATE_ID - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets set a flag separately. I'm working on removing CLOSED_CHANNEL_UPDATE_ID entirely.

/// This includes timestamp to compare between two given
/// [`OurPeerStorage`] and version defines the structure.
#[derive(Clone, PartialEq, Eq)]
pub struct OurPeerStorage {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this into its own file.

/// [`ChannelMonitorUpdateStep::LatestCounterpartyCommitmentTXInfo`]
pub(crate) fn update_state_from_monitor_update(&mut self, cid: ChannelId, monitor_update: ChannelMonitorUpdate) -> Result<(),()> {
for update in monitor_update.updates.iter() {
match update {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logic move into StubChannel instead?

pub(crate) features: ChannelTypeFeatures,
}

impl StubChannel {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better match LDK terminology, should this be a StubChannelMonitor rather than a StubChannel?


let persist_res = self
.persister
.persist_new_channel(stub_channel_monitor.get_funding_txo().0, &stub_channel_monitor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gonna fail if we already have a monitor for this channel, but we should be trying to update that monitor. We may have a somewhat random mix of real (possibly stale) monitors and peer-storage (possibly stale) stubs, and we ideally need to merge all the data we have to be able to recover in the best way possible.

funding_txid_u32.wrapping_add(best_height.unwrap_or_default())
};

let partition_factor = if channel_count < 15 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need to bother with this optimization in the recovery stuff.

let funding_outpoints = hash_set_from_iter(self.monitors.read().unwrap().keys().cloned());
let channel_count = funding_outpoints.len();
for funding_outpoint in funding_outpoints.iter() {
let monitor_lock = self.monitors.read().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if a lot of the code in this file were DRYd with the code elsewhere.

}

let mut monitors = self.monitors.write().unwrap();
let entry = match monitors.entry(stub_channel_monitor.get_funding_txo().0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow we need to replay blockchain data after we add a new monitor here. The flow needs to be (a) load monitor(s) from peer state, (b) replay any chain parts we need on the new monitor (c) persist it.

};

let persist_res = self
.persister
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we persist a stub monitor here, the ChannelManager should fail to load. This should be pretty doable - just have the ChannelManager check for the stub flag on any ChannelMonitors in read and refuse to load if any are set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants