Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodically reconnect to bootstrap peers (supervise bootstrap connections) #416

Closed
btc opened this issue Dec 8, 2014 · 13 comments · Fixed by #422
Closed

periodically reconnect to bootstrap peers (supervise bootstrap connections) #416

btc opened this issue Dec 8, 2014 · 13 comments · Fixed by #422
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@btc
Copy link
Contributor

btc commented Dec 8, 2014

context
The network relies on bootstrap peers to help new nodes get acquainted. When bootstrap peers go down (eg. for maintenance), other nodes on the network lose these important connections. When bootstrap peers finally rejoin, other nodes must restart in order to reconnect. It would be great for nodes to supervise their bootstrap connections... periodically reconnecting to bootstraps whose connections have been lost.

concern
Right now, connection initialization is performed with route.Connect(ctx, peer). Is this the preferred method going forward? The alternative is to use the Network's n.DialPeer(ctx, peer) directly.

If route.Connect is preferred and the Ping action/side-effect is important, should we be concerned about misuse of the network.DialPeer method?

@jbenet

@btc btc added the kind/bug A bug in existing code (including security flaws) label Dec 8, 2014
@whyrusleeping
Copy link
Member

The Ping is important to keep the DHTs routing table up to date. But for reconnects, it probably doesnt matter much

@btc
Copy link
Contributor Author

btc commented Dec 8, 2014

The Ping is important to keep the DHTs routing table up to date.

So I suppose it is critical/required/necessary for the first connection to be performed using Routing, yeah?

@whyrusleeping
Copy link
Member

Yeah, although, Im unhappy with the way that all works... if you want to slightly rearrange things, i would be fully supportive.

@jbenet
Copy link
Member

jbenet commented Dec 8, 2014

So I suppose it is critical/required/necessary for the first connection to be performed using Routing, yeah?

In reality, bootstrap is a routing problem, not a network problem. The underlying network (net, swarm, etc) doesn't actually care about having connections open. (it could live with a model of ephemeral connections for example). It's the routing system of choice (in our case, Kademlia routing) that wants bootstrapping. (A different routing system might use a completely static topology known a priori).

So the Bootstrap process is Routing system dependent. But note that it shouldn't be part of the routing system interface per-se. Bootstrapping is implementation dependent. Kademlia routing only needs peer addresses ((peer.ID, ma.Multiaddr) pairs for now, but once Multiaddr supports ipfs addresses, just ma.Multiaddr). This means that the *IpfsDHT should be initialized with its bootstrap info:

dht, err := dht.NewDHT(..., bootstrapPeers)

And, the bootstrap/reconnection alg should be something like this.

// when starting up (in core.go, somewhere)
go dht.Boostrap()
// in dht somewhere
var minConnThreshold = 1 // later set to something like len(bootstrapPeers)/2

// when routing table connections fall below some threshold
// (note, right now we don't get this event exactly. we expel nodes from the table
// when they fail to respond (and we keep pinging for liveness).
func (dht *IpfsDHT) onRoutingTableExpulsion() {
  if len(routingTablesSize()) < minConnThreshold {
    go dht.Bootstrap()
  }
}

func (dht *IpfsDHT) Bootstrap() {
  ps := dht.randomSubsetOfBootstrapPeers()
  for _, p := range ps {
    n.DialPeer(p)
  }
}

NB: peers don't need to (and in the long run shouldnt) keep active connections to the bootstrap nodes always. This makes bootstrap nodes central points of failure. while the network is small, that's fine, but as the network increases in size and is up all the time, we want bootstrap connections only to help introduce nodes to the network at the beginning.


It's becoming clearer and clearer that the DHT and the RoutingTable are actually two separate (layered) entities. In the future we should separate these. And that there's two parts to routing: ContentRouting and PeerRouting. Kademlia happens to implement both.

Perhaps a good way to proceed is:

  • separate out the routing table + PeerRouting part of the dht into something like KademliaRouting
  • make IpfsDHT uses a KademliaRouting, and layers the DHT part, storing data.

But, this is an "after alpha" thing. We should push out what's here now, with all of this together. We can patch in the bootstrapping as discussed above.

@whyrusleeping
Copy link
Member

As we move towards larger networks, we can avoid the central point of failure problem by having each node keep track of its own set of 'trusted peers' that it has had long, successful relationships with, and can use those nodes to bootstrap.

As to moving the RoutingTable out of the DHT (or at least separating them) Ive been hoping we would do that for some time now, I have other services ive wanted to write (mesh messaging among them) that would benefit greatly from having access to the RoutingTable's NearestPeer methods

@jbenet
Copy link
Member

jbenet commented Dec 8, 2014

As to moving the RoutingTable out of the DHT (or at least separating them) Ive been hoping we would do that for some time now, I have other services ive wanted to write (mesh messaging among them) that would benefit greatly from having access to the RoutingTable's NearestPeer methods

yeah ran into the same thing

@btc
Copy link
Contributor Author

btc commented Dec 8, 2014

But, this is an "after alpha" thing. We should push out what's here now, with all of this together. We can patch in the bootstrapping as discussed above.

To avoid adding complexity that will make it harder to change the DHT when we decide to get around to this, I'm going to implement a simple stop-gap solution in userspace (core) using public DHT methods.

It's basically initConnections with some supervision.

Impl'd here: 00a1a12

Let me know if any objections to this short-term approach.

@btc btc self-assigned this Dec 8, 2014
@btc btc added the status/in-progress In progress label Dec 8, 2014
@jbenet
Copy link
Member

jbenet commented Dec 8, 2014

@maybebtc okay, the userland-supervision-for-now approach sgtm. beautifully small, too.

@btc
Copy link
Contributor Author

btc commented Dec 8, 2014

note that if the list of bootstrap peers changes at any point, the list used here becomes stale.

Adding the repo to the core may address this staleness.

Perhaps the repo can expose a Config() method that fetches the config from disk?

@jbenet
Copy link
Member

jbenet commented Dec 8, 2014

@maybebtc maybe we should move Config into the Repo entirely. and construct the Repo on init.

@btc
Copy link
Contributor Author

btc commented Dec 8, 2014

@maybebtc maybe we should move Config into the Repo entirely. and construct the Repo on init.

Indeed.

How about I open a PR for supervision with this known limitation (stale entries).

And right after, begin the repo transition right. I should be able to finish the repo changes in a few hours, but it would be nice to land supervision and move it out of the way.

  • I'm fixing this concern ATM:
if len(activeConnections) < threshold {
  // connect to a randomly chosen _subset_ of peers
}

@jbenet
Copy link
Member

jbenet commented Dec 8, 2014

How about I open a PR for supervision with this known limitation (stale entries).
And right after, begin the repo transition right.

SGTM

@btc
Copy link
Contributor Author

btc commented Dec 9, 2014

#422

@btc btc closed this as completed in #422 Dec 9, 2014
@btc btc removed the status/in-progress In progress label Dec 9, 2014
@btc btc removed their assignment Mar 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants