BandwidthCounter is erroneously reporting huge spikes #65

albrow · 2019-10-15T21:38:57Z

We recently added low-level rate-limiting in 0x Mesh. It works by using a BandwidthCounter to periodically check the incoming bandwidth for each peer and then banning peers who exceed the bandwidth limit. We are also piping logs to our Elasticsearch stack so we can monitor bandwidth usage.

The logs are showing occasional massive spikes in bandwidth usage, far higher than what is reasonable/possible. In one case BandwidthCounter reported an incoming bandwidth of 9 exabytes/second from a single peer, which unfortunately caused that peer to be temporarily banned 😦. There is no way we are actually using that much bandwidth on commodity cloud hosting services like AWS or DigitalOcean, even over a very short time interval. I think the only reasonable explanation is a bug in BandwidthCounter.

I'll attach some screenshots from our logging infrastructure. Let me know if any additional information could be helpful.

This is what normal bandwidth usage looks like (incoming bandwidth maxes out at around 1.5MB/s total across all peers):

The text was updated successfully, but these errors were encountered:

Stebalien · 2019-10-16T08:52:37Z

Hm. So, this could be due to how we record bandwidth internally. We:

Write.
Record that we sent the bytes.
Then, in a different thread, we keep an EWMA by averaging bandwidth second by second. We use a ticker (not a timer) for this.

Unfortunately, this means:

A large write can all be treated as an instant.
If the ticker falls behind for some reason (heavily loaded machine), we'll have a very short tick (two back to back).

Given a 1ns "tick" and a 10MiB message, we'll get a 9 exabyte spike.

~~Alternatively, this could be a clock adjustment.~~ go prevents this

Stebalien · 2019-10-16T10:46:03Z

If possible (if this is easy to reproduce), could you try building with: libp2p/go-flow-metrics#8? We could also have an overflow/underflow bug somewhere but I'm not seeing it.

albrow · 2019-10-16T17:36:09Z

@Stebalien thanks for your response.

I'm skeptical that this is purely due to a short sample time. Let's suppose that we recorded bandwidth over a period of 1 nanosecond (which is the shortest possible due to the precision of time.Time). In order for us to get a reading of 9 exabytes/second we would need to observe 9 gigabytes of bandwidth in a single nanosecond. In other words 9 exabytes/second == 9 gigabytes/nanosecond. That doesn't really seem possible. Is there maybe something else going on here?

albrow · 2019-10-16T17:55:32Z

In terms of reproducibility, since we don't know exactly what is causing the issue it can take some time to reproduce. The spikes are consistent but not very frequent. The graphs I shared are spikes occurring across the entire network (about 3-5 spikes per day on average). But if you just look at a single node, we're seeing one spike every 2-5 days.

Once you have a new version of libp2p/go-libp2p-core/metrics or libp2p/go-flow-metrics for us to test, it could take up to a few days for us to confirm whether the issue is still occurring.

Stebalien · 2019-10-17T01:51:21Z

You're right. I was calculating in tebibytes, not exbibytes.

How are you extracting this information from libp2p? I know we had an issue with spikes but it was in our logging framework (or grafana? Can't remember).
We could have a bad reader/writer that's returning -1 (along with an error).

albrow · 2019-10-18T19:14:41Z

How are you extracting this information from libp2p?

You can see our code that logs bandwidth usage every 5 minutes, and the separate code that bans peers which have high bandwidth usage. In both cases the code is really straightforward and we aren't doing any additional math or processing on top of what we get from BandwidthCounter.

These logs get converted to JSON and sent to Elasticsearch via Fluent Bit. The screenshots I shared are from Kibana, which we use to visualize the data in Elasticsearch.

One thing you just made me realize is that JSON cannot represent extraordinarily large numbers accurately. While that is true, I don't think that completely explains the behavior we're seeing. Number.MAX_SAFE_INTEGER on 64 bit systems is 2^53-1 and a large majority of the spikes we're seeing are below that number. Only the largest spike that we saw (9 exabytes/sec) exceeds Number.MAX_SAFE_INTEGER. We likely lost some accuracy here, but that figure should still roughly be correct in terms of magnitude.

Moreover, we can see that some peers are actually getting banned as a result of these spikes. That indicates that the numbers inside of Go are extraordinarily high independent of the limits of the JSON format or hypothetical issues with our ELK stack.

albrow · 2019-10-22T18:21:24Z

@Stebalien do you still want me to try out libp2p/go-flow-metrics#8? I'm not sure if we got consensus on that fix but it might still be worth a try.

Stebalien · 2019-10-22T22:11:06Z

Actually... are you doing this from web assembly? If the clock isn't monotonic, we could end up:

Skipping a bunch of updates because the last update time is in the future.
Applying all the updates all at once once we reach the last update time.

However, we'd still need to push 9 GiB of data for that to happen. On the other hand, that could happen ever few days. However, this is still really fishy (unless webassembly is giving us a fake time?).

I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.

albrow · 2019-10-22T22:35:36Z

@Stebalien Good question. None of the results I've shared so far are from WebAssembly. (We actually don't have a way to collect logs from browser nodes at the moment).

I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.

Got it. We might try that PR if there is nothing else for us to do. In the meantime, we probably will need to implement a workaround on our end to account for these spikes. It's hard to do because most workarounds we can think of also open doors to potential attackers/spammers.

Stebalien · 2019-10-22T23:12:15Z

None of the results I've shared so far are from WebAssembly.

Damn. Ok. So, go is usually pretty good about monotonic clocks. I only worried about web assembly as it's new and relatively unsupported.

Stebalien · 2019-10-22T23:31:40Z

I really can't find the bug. I've combed over github.com/libp2p/go-flow-metrics and can't find anything that could cause this.

If you're willing to annotate that code with some debug logging, that would help us figure out what's going on here. Specifically, we'd want to:

Detect if Rate is very large here: https://github.com/libp2p/go-flow-metrics/blob/655b4706c9ab7e08ce0c43775ccb8f9c4fdd4f81/sweeper.go#L111
If so, log an error with everything (total, rate, instant, previous total, previous rate, new time, old time).

albrow · 2019-10-22T23:41:34Z

@Stebalien Good idea. Yes we should be able to add logging for debugging purposes. But keep in mind the feedback loop here will be fairly slow.

Stebalien · 2019-10-23T03:24:52Z

Understood. At this point, I really have no idea what's going on.

albrow · 2019-11-08T00:07:44Z

@Stebalien we have a PR here: 0xProject/0x-mesh#513 which will log everything you asked for (and a bit more) whenever the reported rate is extraordinarily high. As I mentioned earlier, it might take a few days to get results. I'll report anything we find on this thread.

albrow · 2019-11-08T01:02:10Z

@Stebalien Actually it looks like I already have some useful logs. If I'm interpreting this correctly, the bug is occurring a lot more frequently than we thought. (Perhaps it only seemed rare because, relatively speaking, we don't check bandwidth inside of Mesh very frequently). I added the following logging code to go-flow-metrics:

oldRate := m.snapshot.Rate
if m.snapshot.Rate == 0 {
	m.snapshot.Rate = instant
} else {
	m.snapshot.Rate += alpha * (instant - m.snapshot.Rate)
}
if m.snapshot.Rate >= 100000000 {
	logrus.WithFields(logrus.Fields{
		"oldRate":             oldRate,
		"rate":                m.snapshot.Rate,
		"oldTotal":            m.snapshot.Total,
		"newTotal":            total,
		"instant":             instant,
		"diff":                diff,
		"alpha":               alpha,
		"snapshot.LastUpdate": m.snapshot.LastUpdate,
		"sw.LastUpdate":       sw.lastUpdateTime,
		"tdiff":               tdiff,
		"timeMultiplier":      timeMultiplier,
	}).Debug("abnormal rate inside go-flow-metrics")
}

Here's a gist containing some recent results. Let me know if I can provide any additional information.

Stebalien · 2019-11-08T10:48:10Z

Ok, so we're resetting the total to 0 for some reason.

Stebalien · 2019-11-08T11:14:16Z

The bug happens here:

{
  "alpha_number": 0.6321205588285577,
  "diff_number": 18446744073709552000,
  "instant_number": 18467629448269930000,
  "level": "debug",
  "msg": "abnormal rate inside go-flow-metrics",
  "myPeerID": "16Uiu2HAm2qhk786g2KC5KvhHqtwot2hDbCLtwN82MJUrXDAAWzYU",
  "newTotal_number": 0,
  "oldRate_number": 0,
  "oldTotal_number": 818,
  "rate_number": 18467629448269930000,
  "snapshot.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
  "sw.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
  "tdiff_number": 998869082,
  "time": "2019-11-07T12:53:55-08:00",
  "timeMultiplier_number": 1.0011321984235768
}

We do reset the accumulator to 0 when we mark the meter as "idle" but then we go through a very careful dance to unregister old meters.

@vyzo, could you go over the code in sweeper.go in https://github.com/libp2p/go-flow-metrics and see if you can spot a bug in the logic? I can't find anything.

vyzo · 2019-11-08T11:29:04Z

The logic is not so simple; and there is suspicious floating point arithmetic involved.

vyzo · 2019-11-08T12:03:34Z

So possibly an issue is the subtraction of snapshot.Total from the accumulator when we are re-adding.
The comment says "Remove the snapshot total, it'll get added back on registration.", but I don't see that happening on register.

vyzo · 2019-11-08T12:07:09Z

ah it happens at the bottom of the loop; nvm.

vyzo · 2019-11-08T12:37:49Z

So the issue seems to be here: https://github.com/libp2p/go-flow-metrics/blob/master/sweeper.go#L151

We copy a meter from the back, that hasn't been visited yet and could potentially be an idle timer.

vyzo · 2019-11-08T13:14:24Z

libp2p/go-flow-metrics#11 fixes the issue discovered, and hopefully squashes the bug.
@albrow can you test it to verify that it fixes the problem?

vyzo · 2019-11-08T19:13:26Z

The bug might be worse than we thought. In my attempt to make a regression test in libp2p/go-flow-metrics#12 I didn't trigger the spike, but got mangled totals for the affected meters.

See libp2p/go-libp2p-core#65.

albrow mentioned this issue Oct 15, 2019

Temporarily disable bandwidth-based rate-limiting 0xProject/0x-mesh#448

Merged

vyzo added the kind/bug A bug in existing code (including security flaws) label Oct 15, 2019

albrow mentioned this issue Oct 25, 2019

Implement a workaround for BandwidthCounter issues 0xProject/0x-mesh#475

Closed

This was referenced Nov 7, 2019

Make bandwidth banning more lenient for false positives 0xProject/0x-mesh#510

Merged

Enable verbose logging for go-flow-metrics 0xProject/0x-mesh#513

Closed

vyzo mentioned this issue Nov 8, 2019

fix bug in meter traversal logic libp2p/go-flow-metrics#11

Merged

vyzo mentioned this issue Nov 8, 2019

regression test for issue-65 libp2p/go-flow-metrics#12

Merged

albrow added a commit to 0xProject/0x-mesh that referenced this issue Nov 12, 2019

Use a bug fix in go-flow-metrics.

af02667

See libp2p/go-libp2p-core#65.

albrow mentioned this issue Nov 12, 2019

Use a bugfix in go-flow-metrics. 0xProject/0x-mesh#521

Merged

Stebalien closed this as completed Dec 17, 2019

albrow mentioned this issue Jan 22, 2020

Override go-flow-metrics version to fix a regression 0xProject/0x-mesh#660

Merged

dependabot-preview bot mentioned this issue Feb 24, 2020

Bump github.com/libp2p/go-libp2p-core from 0.2.4 to 0.3.1 status-im/status-go#1873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BandwidthCounter is erroneously reporting huge spikes #65

BandwidthCounter is erroneously reporting huge spikes #65

albrow commented Oct 15, 2019 •

edited

Loading

Stebalien commented Oct 16, 2019 •

edited

Loading

Stebalien commented Oct 16, 2019 via email

albrow commented Oct 16, 2019 •

edited

Loading

albrow commented Oct 16, 2019

Stebalien commented Oct 17, 2019

albrow commented Oct 18, 2019

albrow commented Oct 22, 2019

Stebalien commented Oct 22, 2019

albrow commented Oct 22, 2019

Stebalien commented Oct 22, 2019

Stebalien commented Oct 22, 2019

albrow commented Oct 22, 2019

Stebalien commented Oct 23, 2019

albrow commented Nov 8, 2019

albrow commented Nov 8, 2019

Stebalien commented Nov 8, 2019

Stebalien commented Nov 8, 2019

vyzo commented Nov 8, 2019 •

edited

Loading

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

BandwidthCounter is erroneously reporting huge spikes #65

BandwidthCounter is erroneously reporting huge spikes #65

Comments

albrow commented Oct 15, 2019 • edited Loading

Stebalien commented Oct 16, 2019 • edited Loading

Stebalien commented Oct 16, 2019 via email

albrow commented Oct 16, 2019 • edited Loading

albrow commented Oct 16, 2019

Stebalien commented Oct 17, 2019

albrow commented Oct 18, 2019

albrow commented Oct 22, 2019

Stebalien commented Oct 22, 2019

albrow commented Oct 22, 2019

Stebalien commented Oct 22, 2019

Stebalien commented Oct 22, 2019

albrow commented Oct 22, 2019

Stebalien commented Oct 23, 2019

albrow commented Nov 8, 2019

albrow commented Nov 8, 2019

Stebalien commented Nov 8, 2019

Stebalien commented Nov 8, 2019

vyzo commented Nov 8, 2019 • edited Loading

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

vyzo commented Nov 8, 2019

albrow commented Oct 15, 2019 •

edited

Loading

Stebalien commented Oct 16, 2019 •

edited

Loading

albrow commented Oct 16, 2019 •

edited

Loading

vyzo commented Nov 8, 2019 •

edited

Loading