Skip to content
This repository has been archived by the owner on Sep 6, 2022. It is now read-only.

BandwidthCounter is erroneously reporting huge spikes #65

Closed
albrow opened this issue Oct 15, 2019 · 23 comments
Closed

BandwidthCounter is erroneously reporting huge spikes #65

albrow opened this issue Oct 15, 2019 · 23 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@albrow
Copy link

albrow commented Oct 15, 2019

We recently added low-level rate-limiting in 0x Mesh. It works by using a BandwidthCounter to periodically check the incoming bandwidth for each peer and then banning peers who exceed the bandwidth limit. We are also piping logs to our Elasticsearch stack so we can monitor bandwidth usage.

The logs are showing occasional massive spikes in bandwidth usage, far higher than what is reasonable/possible. In one case BandwidthCounter reported an incoming bandwidth of 9 exabytes/second from a single peer, which unfortunately caused that peer to be temporarily banned 😦. There is no way we are actually using that much bandwidth on commodity cloud hosting services like AWS or DigitalOcean, even over a very short time interval. I think the only reasonable explanation is a bug in BandwidthCounter.

I'll attach some screenshots from our logging infrastructure. Let me know if any additional information could be helpful.

Screen Shot 2019-10-15 at 2 32 30 PM

Screen Shot 2019-10-15 at 11 22 39 AM

This is what normal bandwidth usage looks like (incoming bandwidth maxes out at around 1.5MB/s total across all peers):

Screen Shot 2019-10-15 at 2 34 50 PM

@vyzo vyzo added the kind/bug A bug in existing code (including security flaws) label Oct 15, 2019
@Stebalien
Copy link
Member

Stebalien commented Oct 16, 2019

Hm. So, this could be due to how we record bandwidth internally. We:

  1. Write.
  2. Record that we sent the bytes.
  3. Then, in a different thread, we keep an EWMA by averaging bandwidth second by second. We use a ticker (not a timer) for this.

Unfortunately, this means:

  1. A large write can all be treated as an instant.
  2. If the ticker falls behind for some reason (heavily loaded machine), we'll have a very short tick (two back to back).

Given a 1ns "tick" and a 10MiB message, we'll get a 9 exabyte spike.


Alternatively, this could be a clock adjustment. go prevents this

@Stebalien
Copy link
Member

Stebalien commented Oct 16, 2019 via email

@albrow
Copy link
Author

albrow commented Oct 16, 2019

@Stebalien thanks for your response.

I'm skeptical that this is purely due to a short sample time. Let's suppose that we recorded bandwidth over a period of 1 nanosecond (which is the shortest possible due to the precision of time.Time). In order for us to get a reading of 9 exabytes/second we would need to observe 9 gigabytes of bandwidth in a single nanosecond. In other words 9 exabytes/second == 9 gigabytes/nanosecond. That doesn't really seem possible. Is there maybe something else going on here?

@albrow
Copy link
Author

albrow commented Oct 16, 2019

In terms of reproducibility, since we don't know exactly what is causing the issue it can take some time to reproduce. The spikes are consistent but not very frequent. The graphs I shared are spikes occurring across the entire network (about 3-5 spikes per day on average). But if you just look at a single node, we're seeing one spike every 2-5 days.

Once you have a new version of libp2p/go-libp2p-core/metrics or libp2p/go-flow-metrics for us to test, it could take up to a few days for us to confirm whether the issue is still occurring.

@Stebalien
Copy link
Member

You're right. I was calculating in tebibytes, not exbibytes.

  • How are you extracting this information from libp2p? I know we had an issue with spikes but it was in our logging framework (or grafana? Can't remember).
  • We could have a bad reader/writer that's returning -1 (along with an error).

@albrow
Copy link
Author

albrow commented Oct 18, 2019

How are you extracting this information from libp2p?

You can see our code that logs bandwidth usage every 5 minutes, and the separate code that bans peers which have high bandwidth usage. In both cases the code is really straightforward and we aren't doing any additional math or processing on top of what we get from BandwidthCounter.

These logs get converted to JSON and sent to Elasticsearch via Fluent Bit. The screenshots I shared are from Kibana, which we use to visualize the data in Elasticsearch.

One thing you just made me realize is that JSON cannot represent extraordinarily large numbers accurately. While that is true, I don't think that completely explains the behavior we're seeing. Number.MAX_SAFE_INTEGER on 64 bit systems is 2^53-1 and a large majority of the spikes we're seeing are below that number. Only the largest spike that we saw (9 exabytes/sec) exceeds Number.MAX_SAFE_INTEGER. We likely lost some accuracy here, but that figure should still roughly be correct in terms of magnitude.

Moreover, we can see that some peers are actually getting banned as a result of these spikes. That indicates that the numbers inside of Go are extraordinarily high independent of the limits of the JSON format or hypothetical issues with our ELK stack.

@albrow
Copy link
Author

albrow commented Oct 22, 2019

@Stebalien do you still want me to try out libp2p/go-flow-metrics#8? I'm not sure if we got consensus on that fix but it might still be worth a try.

@Stebalien
Copy link
Member

Actually... are you doing this from web assembly? If the clock isn't monotonic, we could end up:

  1. Skipping a bunch of updates because the last update time is in the future.
  2. Applying all the updates all at once once we reach the last update time.

However, we'd still need to push 9 GiB of data for that to happen. On the other hand, that could happen ever few days. However, this is still really fishy (unless webassembly is giving us a fake time?).

I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.

@albrow
Copy link
Author

albrow commented Oct 22, 2019

@Stebalien Good question. None of the results I've shared so far are from WebAssembly. (We actually don't have a way to collect logs from browser nodes at the moment).

I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.

Got it. We might try that PR if there is nothing else for us to do. In the meantime, we probably will need to implement a workaround on our end to account for these spikes. It's hard to do because most workarounds we can think of also open doors to potential attackers/spammers.

@Stebalien
Copy link
Member

None of the results I've shared so far are from WebAssembly.

Damn. Ok. So, go is usually pretty good about monotonic clocks. I only worried about web assembly as it's new and relatively unsupported.

@Stebalien
Copy link
Member

I really can't find the bug. I've combed over github.com/libp2p/go-flow-metrics and can't find anything that could cause this.

If you're willing to annotate that code with some debug logging, that would help us figure out what's going on here. Specifically, we'd want to:

  1. Detect if Rate is very large here: https://github.com/libp2p/go-flow-metrics/blob/655b4706c9ab7e08ce0c43775ccb8f9c4fdd4f81/sweeper.go#L111
  2. If so, log an error with everything (total, rate, instant, previous total, previous rate, new time, old time).

@albrow
Copy link
Author

albrow commented Oct 22, 2019

@Stebalien Good idea. Yes we should be able to add logging for debugging purposes. But keep in mind the feedback loop here will be fairly slow.

@Stebalien
Copy link
Member

Understood. At this point, I really have no idea what's going on.

@albrow
Copy link
Author

albrow commented Nov 8, 2019

@Stebalien we have a PR here: 0xProject/0x-mesh#513 which will log everything you asked for (and a bit more) whenever the reported rate is extraordinarily high. As I mentioned earlier, it might take a few days to get results. I'll report anything we find on this thread.

@albrow
Copy link
Author

albrow commented Nov 8, 2019

@Stebalien Actually it looks like I already have some useful logs. If I'm interpreting this correctly, the bug is occurring a lot more frequently than we thought. (Perhaps it only seemed rare because, relatively speaking, we don't check bandwidth inside of Mesh very frequently). I added the following logging code to go-flow-metrics:

oldRate := m.snapshot.Rate
if m.snapshot.Rate == 0 {
	m.snapshot.Rate = instant
} else {
	m.snapshot.Rate += alpha * (instant - m.snapshot.Rate)
}
if m.snapshot.Rate >= 100000000 {
	logrus.WithFields(logrus.Fields{
		"oldRate":             oldRate,
		"rate":                m.snapshot.Rate,
		"oldTotal":            m.snapshot.Total,
		"newTotal":            total,
		"instant":             instant,
		"diff":                diff,
		"alpha":               alpha,
		"snapshot.LastUpdate": m.snapshot.LastUpdate,
		"sw.LastUpdate":       sw.lastUpdateTime,
		"tdiff":               tdiff,
		"timeMultiplier":      timeMultiplier,
	}).Debug("abnormal rate inside go-flow-metrics")
}

Here's a gist containing some recent results. Let me know if I can provide any additional information.

@Stebalien
Copy link
Member

Ok, so we're resetting the total to 0 for some reason.

@Stebalien
Copy link
Member

The bug happens here:

{
  "alpha_number": 0.6321205588285577,
  "diff_number": 18446744073709552000,
  "instant_number": 18467629448269930000,
  "level": "debug",
  "msg": "abnormal rate inside go-flow-metrics",
  "myPeerID": "16Uiu2HAm2qhk786g2KC5KvhHqtwot2hDbCLtwN82MJUrXDAAWzYU",
  "newTotal_number": 0,
  "oldRate_number": 0,
  "oldTotal_number": 818,
  "rate_number": 18467629448269930000,
  "snapshot.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
  "sw.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
  "tdiff_number": 998869082,
  "time": "2019-11-07T12:53:55-08:00",
  "timeMultiplier_number": 1.0011321984235768
}

We do reset the accumulator to 0 when we mark the meter as "idle" but then we go through a very careful dance to unregister old meters.

@vyzo, could you go over the code in sweeper.go in https://github.com/libp2p/go-flow-metrics and see if you can spot a bug in the logic? I can't find anything.

@vyzo
Copy link
Contributor

vyzo commented Nov 8, 2019

The logic is not so simple; and there is suspicious floating point arithmetic involved.

@vyzo
Copy link
Contributor

vyzo commented Nov 8, 2019

So possibly an issue is the subtraction of snapshot.Total from the accumulator when we are re-adding.
The comment says "Remove the snapshot total, it'll get added back on registration.", but I don't see that happening on register.

@vyzo
Copy link
Contributor

vyzo commented Nov 8, 2019

ah it happens at the bottom of the loop; nvm.

@vyzo
Copy link
Contributor

vyzo commented Nov 8, 2019

So the issue seems to be here: https://github.com/libp2p/go-flow-metrics/blob/master/sweeper.go#L151

We copy a meter from the back, that hasn't been visited yet and could potentially be an idle timer.

@vyzo
Copy link
Contributor

vyzo commented Nov 8, 2019

libp2p/go-flow-metrics#11 fixes the issue discovered, and hopefully squashes the bug.
@albrow can you test it to verify that it fixes the problem?

@vyzo
Copy link
Contributor

vyzo commented Nov 8, 2019

The bug might be worse than we thought. In my attempt to make a regression test in libp2p/go-flow-metrics#12 I didn't trigger the spike, but got mangled totals for the affected meters.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

4 participants
@vyzo @Stebalien @albrow and others