Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kad: Kademlia::address_failed() periodically using a lot of CPU #3498

Closed
nazar-pc opened this issue Feb 23, 2023 · 6 comments
Closed

kad: Kademlia::address_failed() periodically using a lot of CPU #3498

nazar-pc opened this issue Feb 23, 2023 · 6 comments

Comments

@nazar-pc
Copy link
Contributor

Summary

I don't have a clean reproduction for this, but in short I'm getting ~100% usage by libp2p networking thread at times and every time I pause in debugger I get a backtrace that looks like this:

#0  __memcmp_avx2_movbe () at ../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:84
#1  0x0000559d811ad4a0 in hashbrown::map::HashMap<K,V,S,A>::get_mut ()
#2  0x0000559d81123b41 in libp2p_kad::behaviour::Kademlia<TStore>::address_failed ()
#3  0x0000559d81115d36 in <libp2p_kad::behaviour::Kademlia<TStore> as libp2p_swarm::behaviour::NetworkBehaviour>::on_swarm_event ()
#4  0x0000559d81203cd0 in <subspace_networking::behavior::Behavior<RecordStore> as libp2p_swarm::behaviour::NetworkBehaviour>::on_swarm_event ()
#5  0x0000559d8117a1a3 in futures_util::stream::stream::StreamExt::poll_next_unpin ()
...

Clearly it is spending way too much time in address_failed, but why?

I was trying to narrow it down and applied following patch to libp2p sources:

--- a/protocols/kad/src/behaviour.rs	(revision 2de61da642888e3c4deac9925be90d56cdef1475)
+++ b/protocols/kad/src/behaviour.rs	(date 1677117421448)
@@ -1786,7 +1786,7 @@
         }
     }
 
-    fn address_failed(&mut self, peer_id: PeerId, address: &Multiaddr) {
+    fn address_failed(&mut self, peer_id: PeerId, address: &Multiaddr) -> usize {
         let key = kbucket::Key::from(peer_id);
 
         if let Some(addrs) = self.kbuckets.entry(&key).value() {
@@ -1817,11 +1817,16 @@
             }
         }
 
+        let mut x = 0;
         for query in self.queries.iter_mut() {
             if let Some(addrs) = query.inner.addresses.get_mut(&peer_id) {
-                addrs.retain(|a| a != address);
+                addrs.retain(|a| {
+                    x += 1;
+                    a != address
+                });
             }
         }
+        x
     }
 
     fn on_connection_established(
@@ -1940,8 +1945,17 @@
             | DialError::Transport(_)
             | DialError::NoAddresses => {
                 if let DialError::Transport(addresses) = error {
+                    let start = Instant::now();
+                    let mut x = 0;
                     for (addr, _) in addresses {
-                        self.address_failed(peer_id, addr)
+                        x += self.address_failed(peer_id, addr);
+                    }
+                    if x > 1_000_000 {
+                        warn!(
+                            "\nFound {x} entries ({} addresses) in {:?}: {error}\n",
+                            addresses.len(),
+                            start.elapsed()
+                        );
                     }
                 }

Turns out it does spend quite a bit of time there and iterates over non-negligible number of entries over and over again:

way-too-much.txt

It takes quite a few seconds and then goes back to normal just to repeat in a minute or so.

Possible Solution

Not sure yet

Version

https://github.com/subspace/rust-libp2p/tree/subspace-v5, which is fairly recent master branch with #3468 and #3474 added on top.

Would you like to work on fixing this bug?

Yes

@thomaseizinger
Copy link
Contributor

Thanks for the report.

How many peers are you connected to?

@nazar-pc
Copy link
Contributor Author

~30 ± a few

@thomaseizinger
Copy link
Contributor

That def sounds like a bug then.

@nazar-pc

This comment was marked as off-topic.

@thomaseizinger
Copy link
Contributor

Any chance you can run a flamegraph and share it here?

@nazar-pc
Copy link
Contributor Author

I have to close this for now. After changes in the app the issue was gone. It was either due to code inlining in the compiler and the issue wasn't in libp2p all the way or it was triggered under specific conditions I am no longer able to reproduce.

I'll reopen with more details if/when I can reproduce it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants