gossipsub #67

vyzo · 2018-02-20T12:19:37Z

Implements the gossipsub protocol; see https://github.com/vyzo/gerbil-simsub for a high-level literate specification.

TODO:

tests tests tests!
subscription announce messages, sent by base PubSub, need to be reliable so that we accurately track peers in a topic

vyzo · 2018-02-21T09:22:24Z

fixed a minor issue: moved the history shift at the end of the heartbeat where it belongs (otherwise it would take just 2 gossip windows instead of 3 which is the intention)
added tests for mcache

vyzo · 2018-02-21T11:04:48Z

Added a small zoo of basic gossipsub tests, including a mixed mode test with floodsub peers.

vyzo · 2018-02-21T14:42:56Z

made a small tweak -- sources that have not joined the mesh also emit gossip.

vyzo · 2018-02-21T17:17:06Z

Removed a potentially harmful topic membership check for mesh peers; potential inconsistency if the ANNOUNCE was lost (or reordered after GRAFT on retry).

whyrusleeping · 2018-02-21T23:51:43Z

gossipsub.go

+		if len(peers) < GossipSubDlo {
+			ineed := GossipSubD - len(peers)
+			plst := gs.getPeers(topic, ineed, func(p peer.ID) bool {
+				_, ok := peers[p]


this looks like we're filtering something, mind adding a comment on what we're filtering on?

we are filtering the peers that are not already in our peer list; i can add a comment to that extent.

whyrusleeping · 2018-02-22T01:22:15Z

gossipsub.go

+}
+
+func (gs *GossipSubRouter) heartbeatTimer() {
+	ticker := time.NewTicker(1 * time.Second)


heartbeat every second? Does every heartbeat send messages?

Not necessarily.
It will send GRAFT/PRUNE if it needs to adjust the overlay, and schedule gossip for piggybacking.
If the gossip is not sent by the next heartbeat, then it will be flushed in its own messages.

whyrusleeping · 2018-02-22T01:23:05Z

gossipsub.go

+	}
+}
+
+func (gs *GossipSubRouter) heartbeat() {


This function feels too long. Mind trying to break it up a little bit?

Sure i'll refactor a bit, although I would like to keep the main logic together.

I can easily refactor out the part that sends the GRAFT/PRUNE with coalescing, which is also incidental for the logic in the heartbeat.

vyzo · 2018-02-22T09:07:13Z

Implemented retry of ANNOUNCE messages in pubsub, now that we have a test that exercises the relevant code paths.

whyrusleeping · 2018-02-27T20:50:20Z

gossipsub_test.go

+	}
+
+	// wait for heartbeats to build mesh
+	time.Sleep(time.Second * 2)


Why do we always have to wait for heartbeats? Is it because we don't send out subscription notices immediately now?

Well, subscription notices are still sent immediately, just retried if they fail.

Now that I think of it, we can probably greatly reduce this delay in almost all the tests -- maybe down to say 100ms, just enough for announcements to go out.

My rationale for this delay was to wait a couple of heartbeats to avoid interference from nodes who have joined but haven't seen any peer announcements yet. Also, I wanted to avoid interference from the overlay construction, but it should still be connected after the announcements get sent and nodes pick their peers.

Actually, there is a genuine concern that is an artifact of the concurrent Join from subscriptions.
If we subscribe all the nodes together, then they won't have any peer announcements when they do the Join, and they'll have to wait a heartbeat before they start adding peers to the mesh.
We can avoid this if we add a small (say 10ms) delay after each subscription.

Hrm... I'm very skeptical of 'fixing' things by adding delays.

Actually, in most of the tests the subscriptions are created before connecting the network, which means that all nodes start empty and build the mesh purely in the heartbeat

@vyzo It's really really bad practice to use delays like this, especially without a select statement to escape out of it if the context is canceled.

@paralin this is just a test that needs a delay -- and there is nothing to cancel the context so a select would be totally useless.

@vyzo Gotcha, I notice now it's a test.

whyrusleeping · 2018-02-27T20:51:53Z

Could we add some tests that check the number of messages sent, and maybe a way of tracking the overall efficiency of the implementation (like how many nodes received the same message from multiple peers) maybe in terms of bandwidth overhead? like, received 500 bytes for every 200 bytes of useful data at a message size of 100 bytes.

vyzo · 2018-02-27T21:04:40Z

Hrm, these are tests i would like to have too -- but not sure they are really unit tests.
What would be the conditions for test failure?

whyrusleeping · 2018-02-27T21:35:01Z

Yeah, they are definitely integrations tests. No need to write them as unit tests. We should run these tests for floodsub and for gossipsub and compare the results, and choose some failure threshold, i.e. gossipsub should not use more bandwidth than floodsub

vyzo · 2018-03-06T07:54:14Z

We have developed a conflict, so I will rebase.

vyzo · 2018-03-06T08:04:49Z

Rebased; also added a context done check that was missing in the announce retry goroutine.

vyzo · 2018-03-06T09:54:12Z

The TestGossipsubControlPiggyback test would occasionally hang in line 704 because of #69, so I added a fix for the issue.

ORBAT · 2018-03-06T10:00:08Z

Hi folks! Any particular reason you went with a custom protocol instead of something built on Chord/Pastry/PolderCast etc?

vyzo · 2018-03-06T10:08:25Z

hrm, seems like the fix lost the coverage for control piggybacking -- probably because of the slowdown with the Errorf logging.
I will downgrade that to Infof.

vyzo · 2018-03-06T10:14:20Z

Hi folks! Any particular reason you went with a custom protocol instead of something built on Chord/Pastry/PolderCast etc?

@ORBAT several reasons: simplicity of implementation, robustness, and perhaps most important of all backwards compatibility with floodsub so that we can easily deploy.

vyzo · 2018-03-06T10:19:24Z

and control piggyback coverage is back, at least for GRAFT.

whyrusleeping · 2018-03-07T03:54:48Z

@ORBAT It looks like PolderCast didnt make it into our pubsub research reading list: https://ipfs.io/ipfs/QmNinWNHd287finciBwbgovkAqEBQKvnys1W26sY8uupc5/pubsub%20reading%20list.pdf
Likely because the paper costs $30 to read.

In any case, we've done a pretty thorough review of the problem space before we arrived at our version zero protocol, floodsub. With the idea that it is the base layer protocol, and provides very few guarantees. This code, which we're calling gossipsub is an iterative improvement over floodsub that essentially only adds a fairly simple tree pruning via gossip. Simplicity and ease of implementation are very important for us, gossipsub can be implemented in 150 lines of scheme and not too many lines of go.

That said, this is still under review. Review of the protocol and/or implementation is very much welcome.

so that travis doesn't get killed by OOM.

…to variables

…ublished in a while

Stebalien · 2018-06-12T20:29:45Z

I've rebased but I'm having some trouble reproducing the issue. I'm currently running the test in a loop to see if that gets me anywhere.

Stebalien · 2018-06-12T20:35:37Z

No dice.

mhchia · 2018-06-13T07:26:30Z

@whyrusleeping @Stebalien
Sorry for pointing out the wrong issue.
It looks the problem is in my local env.
I will figure out what's wrong here.
Thanks a lot for the help!

jamesray1 · 2018-06-13T07:47:23Z

@whyrusleeping I just read libp2p/interop#1. Having a daemon will of course be useful, although not having to depend on code in Go is preferable, and JSON tests are needed.

whyrusleeping · 2018-07-11T09:17:16Z

I had no reason not to merge this, so I did. Next steps, putting it into a flag in ipfs.

daviddias · 2018-07-12T12:10:41Z

@whyrusleeping shouldn't this be a separate pubsub implementation so that folks can pick the pubsub implementation to use?

Will this PR make PubSub in go-ipfs not interop with js-ipfs?

daviddias · 2018-07-12T12:13:11Z

@whyrusleeping just confirmed that this package would be better named go-pubsub and then we would have two others for go-floodsub and go-gossipsub to plug in here but refactoring things in go is hard so that will happen later.

Interop remains

jamesray1 · 2018-07-16T07:28:42Z

I can't find any mention of DHT (looking in relation to the mention "The initial contact nodes can be obtained via rendezvous with DHT provider records." here. Will this be done in a separate interface (that uses the DHT in libp2p, as well as gossipsub)?

Also you really should use constant variables instead of literals.

Should we also specify a common source of randomness for interoperability?

mhchia · 2018-07-16T15:40:08Z

@jamesray1
Maybe it is because we can use different routing mechanism in the underlying overlay, not necessarily using DHT? In this case, "The initial contact nodes can be obtained via rendezvous with DHT provider records" might only be an example.

jamesray1 · 2018-07-16T23:58:49Z

@mhchia, sure, that's fine.

whyrusleeping · 2018-07-17T23:58:34Z

@diasdavid

but refactoring things in go is hard so that will happen later.

Its more 'extracting things into multiple packages in go is annoying to do when you might be changing things in both really soon'.

@jamesray1 @mhchia Yeah, the DHT is only an example. You can use any means to rendezvous. Take a look at our rendezvous spec proposal for ideas towards a more specialized way of doing rendezvous

jamesray1 · 2018-07-24T02:04:05Z

What duration should we use for timeout requests?

Context: implementing a system config for Kademlia to use to get nodes.

https://github.com/libp2p/rust-libp2p/blob/7507e0bfd9f11520f2d6291120f1b68d0afce80a/kad/src/high_level.rs#L36

As for the timeout duration, according to RabbitMQ, that is twice the heartbeat interval, which is 1 s in this Go implementation, so based on that it would be 2 s. However later on the same page it says a timeout of 5 to 20 s is optimal.

I am guessing to use 40000 s for kbuckets_timeout (the Duration after which a node in the k-buckets needs to be pinged again.) but I'm not really sure, perhaps the spec and this implementation should also define initialization?

I'll look further into this.

https://www.kth.se/social/upload/516479a5f276545d6a965080/3-kademlia.pdf says tRefresh is 3600 s, after which an otherwise unaccessed bucket must be refreshed, which is supported by http://www.scs.stanford.edu/%7Edm/home/papers/kpos.pdf, but this isn't explicitly the same as the duration at which a node needs to be pinged again, although node IDs are stored in each kbucket.

OK at the moment I'm selecting:

    /// tRefresh in Kademlia implementations, sources:
    /// http://xlattice.sourceforge.net/components/protocol/kademlia/specs.html#refresh
    /// https://www.kth.se/social/upload/516479a5f276545d6a965080/3-kademlia.pdf
    /// 1 hour
    kbuckets_timeout: Duration.hour(1)
    /// go gossipsub uses 1 s:
    /// https://github.com/libp2p/go-floodsub/pull/67/files#diff-013da88fee30f5c765f693797e8b358dR30
    /// However, https://www.rabbitmq.com/heartbeats.html#heartbeats-timeout uses 60 s, and
    /// https://gist.github.com/gubatron/cd9cfa66839e18e49846#routing-table uses 15 minutes.
    /// Let's make a conservative selection and choose 15 minutes for an alpha release.
    request_timeout: Duration.minutes(15),

raulk · 2018-08-28T12:39:59Z

gossipsub.go

+}
+
+func (gs *GossipSubRouter) handleIWant(ctl *pb.ControlMessage) []*pb.Message {
+	ihave := make(map[string]*pb.Message)


Does this need to be a map, or could it be a slice? Is it a map to deduplicate message IDs?

yes, it needs to deduplicate.

ghost assigned vyzo Feb 20, 2018

ghost added the in progress label Feb 20, 2018

vyzo requested review from Stebalien and whyrusleeping February 20, 2018 12:19

rargulati mentioned this pull request Feb 20, 2018

state: mock skeleton for random beacon event loop. keep-network/keep-core#25

Closed

2 tasks

vyzo force-pushed the feat/gossipsub branch from a91268b to 906f988 Compare February 21, 2018 12:10

vyzo force-pushed the feat/gossipsub branch from b19c785 to f33037f Compare February 21, 2018 15:40

whyrusleeping reviewed Feb 21, 2018

View reviewed changes

whyrusleeping reviewed Feb 22, 2018

View reviewed changes

vyzo mentioned this pull request Feb 26, 2018

PulsarCast M.Sc Thesis - Scaling libp2p PubSub ipfs/notes#266

Open

whyrusleeping reviewed Feb 27, 2018

View reviewed changes

ghost mentioned this pull request Mar 5, 2018

pub/sub - publish / subscribe ipfs/notes#64

Open

3 tasks

vyzo force-pushed the feat/gossipsub branch from 06d7ef9 to 85b3871 Compare March 6, 2018 08:03

vyzo added 9 commits June 12, 2018 08:00

finetune GraftPruneRetry test, so that it doesn't get OOM killed

0824316

announce retry should check the pubsub context for cancellation

2544ae7

smaller net sizes for tests that exercise full queues

a39184a

so that travis doesn't get killed by OOM.

increase the flood length in TestGossipsubControlPiggyback

c57d256

document PubSubRouter interface

e8a91d3

refactor nextSeqno out of Publish

d6dfe83

make heartbeat interval a parameter, turn all gossipsub parameters in…

b490d11

…to variables

more docs for gossipsub router, expire fanout peers when we haven't p…

1dc8405

…ublished in a while

fix NewPubsub docstring

1b4fbb8

Stebalien force-pushed the feat/gossipsub branch from 88a9bb0 to 1b4fbb8 Compare June 12, 2018 15:19

prestonvanloon mentioned this pull request Jun 30, 2018

p2p vendor code prysmaticlabs/prysm#221

Closed

whyrusleeping merged commit b53a056 into master Jul 11, 2018

ghost removed the in progress label Jul 11, 2018

whyrusleeping deleted the feat/gossipsub branch July 11, 2018 09:16

daviddias mentioned this pull request Jul 16, 2018

Naming & modularization #89

Closed

raulk reviewed Aug 28, 2018

View reviewed changes

aschmahmann mentioned this pull request Oct 4, 2019

Try to fix flaky tests by waiting for subscriptions & mesh to be ready #203

Closed

JustinDrake mentioned this pull request Mar 25, 2020

Multiple questions and datapoints on PubSub (FloodSub, Gossipsub, gerbil-simsub) libp2p/notes#19

Open

gossipsub #67

gossipsub #67

Conversation

vyzo commented Feb 20, 2018 • edited Loading

vyzo commented Feb 21, 2018

vyzo commented Feb 21, 2018

vyzo commented Feb 21, 2018

vyzo commented Feb 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo commented Feb 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo Feb 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whyrusleeping commented Feb 27, 2018

vyzo commented Feb 27, 2018

whyrusleeping commented Feb 27, 2018

vyzo commented Mar 6, 2018 • edited Loading

vyzo commented Mar 6, 2018

vyzo commented Mar 6, 2018 • edited Loading

ORBAT commented Mar 6, 2018

vyzo commented Mar 6, 2018

vyzo commented Mar 6, 2018 • edited Loading

vyzo commented Mar 6, 2018

whyrusleeping commented Mar 7, 2018

Stebalien commented Jun 12, 2018

Stebalien commented Jun 12, 2018

mhchia commented Jun 13, 2018

jamesray1 commented Jun 13, 2018

whyrusleeping commented Jul 11, 2018

daviddias commented Jul 12, 2018

daviddias commented Jul 12, 2018 • edited Loading

jamesray1 commented Jul 16, 2018 • edited Loading

mhchia commented Jul 16, 2018

jamesray1 commented Jul 16, 2018

whyrusleeping commented Jul 17, 2018

jamesray1 commented Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo commented Feb 20, 2018 •

edited

Loading

vyzo Feb 27, 2018 •

edited

Loading

vyzo Feb 28, 2018 •

edited

Loading

vyzo Mar 6, 2018 •

edited

Loading

vyzo commented Mar 6, 2018 •

edited

Loading

vyzo commented Mar 6, 2018 •

edited

Loading

vyzo commented Mar 6, 2018 •

edited

Loading

daviddias commented Jul 12, 2018 •

edited

Loading

jamesray1 commented Jul 16, 2018 •

edited

Loading

jamesray1 commented Jul 24, 2018 •

edited

Loading