feat(discovery): discover triggered too frequently #3550

guillaumemichel · 2024-07-03T13:10:34Z

In the case a node doesn't have its quota of peers, it will send a new discover request every second. There are no guarantees that the discover request can complete within 1 second.

celestia-node/share/p2p/discovery/discovery.go

Lines 225 to 229 in accb058

    
           case <-t.C: 
        
           	if !d.discover(ctx) { 
        
           		// rerun discovery if the number of peers hasn't reached the limit 
        
           		continue 
        
           	}

This interval seems too aggressive and should probably be increased (e.g to 1 or even 5 minutes?)

The text was updated successfully, but these errors were encountered:

renaynay · 2024-07-04T09:50:10Z

@guillaumemichel i agree that the interval for retrying is a bit too aggressive and could be increased, but the actual deadline to FindPeers is a minute.

celestia-node/share/p2p/discovery/discovery.go

Lines 286 to 293 in accb058

    
           findCtx, findCancel := context.WithTimeout(ctx, findPeersTimeout) 
        
           defer func() { 
        
           	// some workers could still be running, wait them to finish before canceling findCtx 
        
           	wg.Wait() //nolint:errcheck 
        
           	findCancel() 
        
           }() 
        
           peers, err := d.disc.FindPeers(findCtx, d.tag)

guillaumemichel · 2024-07-04T09:59:51Z

So basically there will always be a lookup running until enough peers are discovered.

Wondertan · 2024-07-04T12:15:19Z

So basically there will always be a lookup running until enough peers are discovered.

IIRC, that was the intention, but as you mentioned that might be too aggressive

walldiss · 2024-07-17T19:24:39Z

Sometimes, a node fails to discover any peers on startup. Without discovered peers, the node is unable to perform most of its P2P logic. The downside of a longer cooldown is that it will cause the node application to halt for the cooldown duration in such cases. If some peers are discovered, the node should still aim to find at least 5 (our default) to rely less on the performance and availability of a single peer.

The idea behind aggressive retries is to bootstrap node into a stable network condition as soon as possible, perhaps at the cost of more resources spent on aggressive discovery. So I think the defaults should stay low.

I think it might be valuable for some users to have the ability to increase retry/timeout values if they are less concerned about the node being connected to the FN network.

github-actions bot added needs:triage external Issues created by non node team members labels Jul 3, 2024

ramin self-assigned this Jul 4, 2024

ramin added the kind:misc Attached to miscellaneous PRs label Jul 4, 2024

ramin removed the needs:triage label Jul 4, 2024

ramin linked a pull request Jul 11, 2024 that will close this issue

misc(share/p2p): reduce frequency of discovery retries #3561

Open

renaynay added the v0.15.0 Intended for v0.15.0 release label Jul 16, 2024

renaynay removed the v0.15.0 Intended for v0.15.0 release label Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(discovery): discover triggered too frequently #3550

feat(discovery): discover triggered too frequently #3550

guillaumemichel commented Jul 3, 2024

renaynay commented Jul 4, 2024

guillaumemichel commented Jul 4, 2024

Wondertan commented Jul 4, 2024

walldiss commented Jul 17, 2024

feat(discovery): discover triggered too frequently #3550

feat(discovery): discover triggered too frequently #3550

Comments

guillaumemichel commented Jul 3, 2024

renaynay commented Jul 4, 2024

guillaumemichel commented Jul 4, 2024

Wondertan commented Jul 4, 2024

walldiss commented Jul 17, 2024