Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossipsub #67

Merged
merged 50 commits into from
Jul 11, 2018
Merged

gossipsub #67

merged 50 commits into from
Jul 11, 2018

Conversation

vyzo
Copy link
Collaborator

@vyzo vyzo commented Feb 20, 2018

Implements the gossipsub protocol; see https://github.com/vyzo/gerbil-simsub for a high-level literate specification.

TODO:

  • tests tests tests!
  • subscription announce messages, sent by base PubSub, need to be reliable so that we accurately track peers in a topic

@vyzo
Copy link
Collaborator Author

vyzo commented Feb 21, 2018

  • fixed a minor issue: moved the history shift at the end of the heartbeat where it belongs (otherwise it would take just 2 gossip windows instead of 3 which is the intention)
  • added tests for mcache

@vyzo
Copy link
Collaborator Author

vyzo commented Feb 21, 2018

Added a small zoo of basic gossipsub tests, including a mixed mode test with floodsub peers.

@vyzo
Copy link
Collaborator Author

vyzo commented Feb 21, 2018

made a small tweak -- sources that have not joined the mesh also emit gossip.

@vyzo
Copy link
Collaborator Author

vyzo commented Feb 21, 2018

Removed a potentially harmful topic membership check for mesh peers; potential inconsistency if the ANNOUNCE was lost (or reordered after GRAFT on retry).

gossipsub.go Outdated
if len(peers) < GossipSubDlo {
ineed := GossipSubD - len(peers)
plst := gs.getPeers(topic, ineed, func(p peer.ID) bool {
_, ok := peers[p]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like we're filtering something, mind adding a comment on what we're filtering on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are filtering the peers that are not already in our peer list; i can add a comment to that extent.

gossipsub.go Outdated
}

func (gs *GossipSubRouter) heartbeatTimer() {
ticker := time.NewTicker(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heartbeat every second? Does every heartbeat send messages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily.
It will send GRAFT/PRUNE if it needs to adjust the overlay, and schedule gossip for piggybacking.
If the gossip is not sent by the next heartbeat, then it will be flushed in its own messages.

}
}

func (gs *GossipSubRouter) heartbeat() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function feels too long. Mind trying to break it up a little bit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure i'll refactor a bit, although I would like to keep the main logic together.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can easily refactor out the part that sends the GRAFT/PRUNE with coalescing, which is also incidental for the logic in the heartbeat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vyzo
Copy link
Collaborator Author

vyzo commented Feb 22, 2018

Implemented retry of ANNOUNCE messages in pubsub, now that we have a test that exercises the relevant code paths.

}

// wait for heartbeats to build mesh
time.Sleep(time.Second * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we always have to wait for heartbeats? Is it because we don't send out subscription notices immediately now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, subscription notices are still sent immediately, just retried if they fail.

Now that I think of it, we can probably greatly reduce this delay in almost all the tests -- maybe down to say 100ms, just enough for announcements to go out.

My rationale for this delay was to wait a couple of heartbeats to avoid interference from nodes who have joined but haven't seen any peer announcements yet. Also, I wanted to avoid interference from the overlay construction, but it should still be connected after the announcements get sent and nodes pick their peers.

Copy link
Collaborator Author

@vyzo vyzo Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a genuine concern that is an artifact of the concurrent Join from subscriptions.
If we subscribe all the nodes together, then they won't have any peer announcements when they do the Join, and they'll have to wait a heartbeat before they start adding peers to the mesh.
We can avoid this if we add a small (say 10ms) delay after each subscription.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... I'm very skeptical of 'fixing' things by adding delays.

Copy link
Collaborator Author

@vyzo vyzo Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in most of the tests the subscriptions are created before connecting the network, which means that all nodes start empty and build the mesh purely in the heartbeat

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyzo It's really really bad practice to use delays like this, especially without a select statement to escape out of it if the context is canceled.

Copy link
Collaborator Author

@vyzo vyzo Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paralin this is just a test that needs a delay -- and there is nothing to cancel the context so a select would be totally useless.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyzo Gotcha, I notice now it's a test.

@whyrusleeping
Copy link
Contributor

Could we add some tests that check the number of messages sent, and maybe a way of tracking the overall efficiency of the implementation (like how many nodes received the same message from multiple peers) maybe in terms of bandwidth overhead? like, received 500 bytes for every 200 bytes of useful data at a message size of 100 bytes.

@vyzo
Copy link
Collaborator Author

vyzo commented Feb 27, 2018

Hrm, these are tests i would like to have too -- but not sure they are really unit tests.
What would be the conditions for test failure?

@whyrusleeping
Copy link
Contributor

Yeah, they are definitely integrations tests. No need to write them as unit tests. We should run these tests for floodsub and for gossipsub and compare the results, and choose some failure threshold, i.e. gossipsub should not use more bandwidth than floodsub

@ghost ghost mentioned this pull request Mar 5, 2018
3 tasks
@vyzo
Copy link
Collaborator Author

vyzo commented Mar 6, 2018

We have developed a conflict, so I will rebase.

@vyzo
Copy link
Collaborator Author

vyzo commented Mar 6, 2018

Rebased; also added a context done check that was missing in the announce retry goroutine.

@vyzo
Copy link
Collaborator Author

vyzo commented Mar 6, 2018

The TestGossipsubControlPiggyback test would occasionally hang in line 704 because of #69, so I added a fix for the issue.

@ORBAT
Copy link

ORBAT commented Mar 6, 2018

Hi folks! Any particular reason you went with a custom protocol instead of something built on Chord/Pastry/PolderCast etc?

@vyzo
Copy link
Collaborator Author

vyzo commented Mar 6, 2018

hrm, seems like the fix lost the coverage for control piggybacking -- probably because of the slowdown with the Errorf logging.
I will downgrade that to Infof.

@vyzo
Copy link
Collaborator Author

vyzo commented Mar 6, 2018

Hi folks! Any particular reason you went with a custom protocol instead of something built on Chord/Pastry/PolderCast etc?

@ORBAT several reasons: simplicity of implementation, robustness, and perhaps most important of all backwards compatibility with floodsub so that we can easily deploy.

@vyzo
Copy link
Collaborator Author

vyzo commented Mar 6, 2018

and control piggyback coverage is back, at least for GRAFT.

@whyrusleeping
Copy link
Contributor

@ORBAT It looks like PolderCast didnt make it into our pubsub research reading list: https://ipfs.io/ipfs/QmNinWNHd287finciBwbgovkAqEBQKvnys1W26sY8uupc5/pubsub%20reading%20list.pdf
Likely because the paper costs $30 to read.

In any case, we've done a pretty thorough review of the problem space before we arrived at our version zero protocol, floodsub. With the idea that it is the base layer protocol, and provides very few guarantees. This code, which we're calling gossipsub is an iterative improvement over floodsub that essentially only adds a fairly simple tree pruning via gossip. Simplicity and ease of implementation are very important for us, gossipsub can be implemented in 150 lines of scheme and not too many lines of go.

That said, this is still under review. Review of the protocol and/or implementation is very much welcome.

@Stebalien
Copy link
Member

I've rebased but I'm having some trouble reproducing the issue. I'm currently running the test in a loop to see if that gets me anywhere.

@Stebalien
Copy link
Member

No dice.

@mhchia
Copy link

mhchia commented Jun 13, 2018

@whyrusleeping @Stebalien
Sorry for pointing out the wrong issue.
It looks the problem is in my local env.
I will figure out what's wrong here.
Thanks a lot for the help!

@jamesray1
Copy link
Contributor

@whyrusleeping I just read libp2p/interop#1. Having a daemon will of course be useful, although not having to depend on code in Go is preferable, and JSON tests are needed.

@whyrusleeping whyrusleeping merged commit b53a056 into master Jul 11, 2018
@ghost ghost removed the in progress label Jul 11, 2018
@whyrusleeping whyrusleeping deleted the feat/gossipsub branch July 11, 2018 09:16
@whyrusleeping
Copy link
Contributor

I had no reason not to merge this, so I did. Next steps, putting it into a flag in ipfs.

@daviddias
Copy link
Member

@whyrusleeping shouldn't this be a separate pubsub implementation so that folks can pick the pubsub implementation to use?

Will this PR make PubSub in go-ipfs not interop with js-ipfs?

@daviddias
Copy link
Member

daviddias commented Jul 12, 2018

@whyrusleeping just confirmed that this package would be better named go-pubsub and then we would have two others for go-floodsub and go-gossipsub to plug in here but refactoring things in go is hard so that will happen later.

Interop remains

@jamesray1
Copy link
Contributor

jamesray1 commented Jul 16, 2018

I can't find any mention of DHT (looking in relation to the mention "The initial contact nodes can be obtained via rendezvous with DHT provider records." here. Will this be done in a separate interface (that uses the DHT in libp2p, as well as gossipsub)?

Also you really should use constant variables instead of literals.

Should we also specify a common source of randomness for interoperability?

@daviddias daviddias mentioned this pull request Jul 16, 2018
@mhchia
Copy link

mhchia commented Jul 16, 2018

@jamesray1
Maybe it is because we can use different routing mechanism in the underlying overlay, not necessarily using DHT? In this case, "The initial contact nodes can be obtained via rendezvous with DHT provider records" might only be an example.

@jamesray1
Copy link
Contributor

@mhchia, sure, that's fine.

@whyrusleeping
Copy link
Contributor

@diasdavid

but refactoring things in go is hard so that will happen later.

Its more 'extracting things into multiple packages in go is annoying to do when you might be changing things in both really soon'.

@jamesray1 @mhchia Yeah, the DHT is only an example. You can use any means to rendezvous. Take a look at our rendezvous spec proposal for ideas towards a more specialized way of doing rendezvous

@jamesray1
Copy link
Contributor

jamesray1 commented Jul 24, 2018

What duration should we use for timeout requests?

Context: implementing a system config for Kademlia to use to get nodes.

https://github.com/libp2p/rust-libp2p/blob/7507e0bfd9f11520f2d6291120f1b68d0afce80a/kad/src/high_level.rs#L36

As for the timeout duration, according to RabbitMQ, that is twice the heartbeat interval, which is 1 s in this Go implementation, so based on that it would be 2 s. However later on the same page it says a timeout of 5 to 20 s is optimal.

I am guessing to use 40000 s for kbuckets_timeout (the Duration after which a node in the k-buckets needs to be pinged again.) but I'm not really sure, perhaps the spec and this implementation should also define initialization?

I'll look further into this.

https://www.kth.se/social/upload/516479a5f276545d6a965080/3-kademlia.pdf says tRefresh is 3600 s, after which an otherwise unaccessed bucket must be refreshed, which is supported by http://www.scs.stanford.edu/%7Edm/home/papers/kpos.pdf, but this isn't explicitly the same as the duration at which a node needs to be pinged again, although node IDs are stored in each kbucket.

OK at the moment I'm selecting:

    /// tRefresh in Kademlia implementations, sources:
    /// http://xlattice.sourceforge.net/components/protocol/kademlia/specs.html#refresh
    /// https://www.kth.se/social/upload/516479a5f276545d6a965080/3-kademlia.pdf
    /// 1 hour
    kbuckets_timeout: Duration.hour(1)
    /// go gossipsub uses 1 s:
    /// https://github.com/libp2p/go-floodsub/pull/67/files#diff-013da88fee30f5c765f693797e8b358dR30
    /// However, https://www.rabbitmq.com/heartbeats.html#heartbeats-timeout uses 60 s, and
    /// https://gist.github.com/gubatron/cd9cfa66839e18e49846#routing-table uses 15 minutes.
    /// Let's make a conservative selection and choose 15 minutes for an alpha release.
    request_timeout: Duration.minutes(15),

}

func (gs *GossipSubRouter) handleIWant(ctl *pb.ControlMessage) []*pb.Message {
ihave := make(map[string]*pb.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a map, or could it be a slice? Is it a map to deduplicate message IDs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it needs to deduplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.