Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: not dial all known peers in parallel on startup #698

Merged
merged 9 commits into from
Jul 14, 2020

Conversation

vasco-santos
Copy link
Member

This PR aims to block large numbers of dials on a peer restart, if this peer already new a large number of peers in the past.

It checks the minPeers as the maybeConnect function used by the discovery event handlers. This was also using the minPeers, but as all the dials were put in parallel, the connectionManager.size would not be updated to avoid further dials.

When we work on connectionManager, we should work on better strategies for this, and likely have them as configurable

src/index.js Outdated
const minPeers = this._options.connectionManager.minPeers || 0

// TODO: this should be ordered by score on PeerStoreV2
this.peerStore.peers.forEach(async (peer) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobheun I started by doing the dials in serial as the token based dial will already use parallel dials for several multiaddrs the peer we are targeting might have. Also, the discovery services kicked in, and their new discoveries might be more important than the older ones (this is where having scores for the older peers would help a lot too).

Another solution I thought is to do the minPeers parallel dials and then do the rest in serial (to compensate possible offline peers). Let me know what do you think about the pros and cons of both approaches.

@vasco-santos vasco-santos changed the title fix: not dial all known peers on startup fix: not dial all known peers in parallel on startup Jul 9, 2020
src/index.js Outdated
* @returns {void}
*/
_maybeConnectPeerStorePeers () {
if (!this._config.peerDiscovery.autoDial) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should deserve its own config in the peerStore instead? 🤔

@vasco-santos vasco-santos force-pushed the fix/not-dial-all-know-peers-on-startup branch from e25915f to 695f77c Compare July 9, 2020 09:22
@jacobheun
Copy link
Contributor

I think the more general case that would be helpful here is going to the PeerStore when our connection count is below the low watermark. Right now autoDial is only ever triggered on peer discovery. I don't think we should be mass emitting peer discovered events on boot for persisted peers. Application code can check the peer store.

I think we should add a counterpart of the ConnectionManager_.maybeDisconnectOne that deals with proactive connecting. If our connection count is below our low watermark, we should attempt to dial peers from our peerstore (ideally ranked, but we can do that later) that we are not connected to, until we are at or above the low watermark.
On startup, we can just start this recursive call. We could batch the requests to reduce initial connection time. This should also get called when a peer disconnects and we drop below the low watermark.

@vasco-santos
Copy link
Member Author

I don't think we should be mass emitting peer discovered events on boot for persisted peers. Application code can check the peer store.

Totally agree, but I don't think this is the correct timing for that. We are doing that in 0.27 and 0.28 and this will be a breaking change for who expects this! I agree we include this in 0.30. I can open an issue for it.

I think the more general case that would be helpful here is going to the PeerStore when our connection count is below the low watermark.

This PR does that for boot and I agree with this, but with some other things in consideration. When having a more intelligent PeerStore/ConnectionManager, we should select peers based on their score and the protocols we are running.

I think we should add a counterpart of the ConnectionManager_.maybeDisconnectOne that deals with proactive connecting. If our connection count is below our low watermark, we should attempt to dial peers from our peerstore (ideally ranked, but we can do that later) that we are not connected to, until we are at or above the low watermark.

It seems a good idea to do it on disconnect, but just iterating blindly in the PeerStore and connect will not likely bring much value. But I think we can include it here as it is quite simple.

On startup, we can just start this recursive call. We could batch the requests to reduce initial connection time. This should also get called when a peer disconnects and we drop below the low watermark.

This seems the way to go for me, but this should happen for the 0.28.x line for me.

@jacobheun
Copy link
Contributor

Totally agree, but I don't think this is the correct timing for that. We are doing that in 0.27 and 0.28 and this will be a breaking change for who expects this!

We can talk about this separately, it doesn't need to change for this, but it should change soon because we're going to hit scaling issues. If I know thousands of peers when I boot, that's going to get nasty real quick.

It seems a good idea to do it on disconnect, but just iterating blindly in the PeerStore and connect will not likely bring much value.

Actually, we do have some metrics we could use. Instead of looking at ALL peers in the peer store, for now we could prioritize peers we have keys/protocols for. Both should indicate a previous direct exchange, although keys might have come indirectly (i cant recall which subsystems we might have added this to in JS).

@vasco-santos vasco-santos marked this pull request as ready for review July 9, 2020 13:46
@@ -206,6 +210,7 @@ class ConnectionManager extends EventEmitter {
this.connections.delete(peerId)
this._peerValues.delete(connection.remotePeer.toB58String())
this.emit('peer:disconnect', connection)
this._maybeConnectN()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a disconnect occurs and the previous call to _maybeConnectN happens we're going to run two calls in parallel, we should avoid this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the _maybeConnectN in start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can potentially have it running from the start call while trying to dial a bunch of peers and one of them meanwhile disconnected. I can add a verification for wether we are currently running it

Comment on lines 292 to 293
this._isTryingToConnect = true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go away

Suggested change
this._isTryingToConnect = true

src/connection-manager/index.js Show resolved Hide resolved
}
}

this._maybeConnectTimeout = setTimeout(this._maybeConnectN, this._options.maybeConnectInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use retimer for this. We're already using it and it will cut down on multiple timers in the background.

},
connectionManager: {
minConnections,
maybeConnectInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be autoDialInterval. Since it's the wrong param, the default should be 10s, it's odd this test is passing. I assume it's actively in the process of connecting when the first check is done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I debugged it earlier. It was failing sometimes before I add the interval param. I will fix it now

Copy link
Contributor

@jacobheun jacobheun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We'll look at cleaning up this connection manager flow when we tackle the v2 work for that (tentatively 0.30/0.31).

@jacobheun jacobheun merged commit 9ccab40 into master Jul 14, 2020
@jacobheun jacobheun deleted the fix/not-dial-all-know-peers-on-startup branch July 14, 2020 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants