fix: TestShareAvailable_DisconnectedFullNodes #2560

distractedm1nd · 2023-08-11T10:14:15Z

TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused it to be a false positive. The following issues were fixed:

The parameters used now don't allow for reconstruction from a single subnetwork, as they did before. This has been verified with a simulation - for s = 20, and k = 16, c set to 32 gives a 99.5% chance of reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before, both values of c gave a near 100% chance for reconstruction.
Increased the timeout for the first reconstruction. There was a check that verified that reconstruction was not possible from a single subnetwork, but it only failed to reconstruct because the timeout was too short. Increasing the timeout with the original parameters verified this.
Shape the network topology before the first reconstruction (for example, separating the LNs into subnets)
Calling final reconstruction in an error group. The full nodes are codependent on each other for either one to reconstruct. Calling them sequentially fails, because the first full node needs the second full node to have sampled in order to get shares from it. This was another signal that the full nodes could reconstruct from a single subnetwork, and it has now been fixed by allowing both FNs to sample concurrently.

share/availability/full/reconstruction_test.go

Wondertan · 2023-08-11T11:20:02Z

Thanks for this, @distractedm1nd! Before we merge it, I would like to make a deeper review of changes. Can this wait until Monday?
Cc @renaynay

Wondertan

A few questions and recalling reasons behind why some things were done in that way:

I am trying to understand why 32. I presume you run some simulations using python script from Mus, but this is different from values on the paper. Table 1 states that for k=16 and s=20, c is 69 with 99% success. Why are these different?
The quadrant timeout multiplied by the number of quadrants(4*2 axis) should work as it's the total time Retriever spends anyway. I didn't want tests to run more than necessary. The problem is in the init func at the top of the file that sets Retriever timeout to a turned out to be a low value. I think I didn't catch that due to false positive success.
- Actually, the test failure at chore(deps): bump ipfs networking deps #2478 shows that the timeout is enough with the latest optimizations in our dependencies so we may keep the timeouts minimal as they are rn.
I remember doing such an order purposefully. In other tests, we start reconstruction first and form the topology after in order to test that FNs are capable to reconstruct with LNs joining after the reconstruction starts. In practice, a FN may not have enough LNs to reconstruct, but more LNs would discover it, connect, and serve missing shares. However, while this is required for other tests, it's not required for this test scenario.
Makes total sense and is a great find!

distractedm1nd · 2023-08-14T16:41:04Z

I know, - the problem here is that \gamma is IIUC referring to withholding attack. In our case, we only need (k + 1)^2 + 1 shares to ensure full recovery. We should probably write tests including malicious withholding to back this up.
Okay, if the timeouts are minimal enough and aren't flaky then I will revert
Ok
<3

cc @musalbas

distractedm1nd · 2023-08-14T16:42:24Z

For reference, here is my sim:

import random

k = 16
c = 20
s = 16
total_elements = (2 * k) ** 2
count = 0

for _ in range(10000):
    unique_elements_picked = set()
    for _ in range(c):
        for _ in range(s):
            element = random.randint(0, total_elements - 1)
            unique_elements_picked.add(element)
    if len(unique_elements_picked) > (k + 1) ** 2:
        count += 1

probability = count / iterations

print(f"Estimated Probability: {probability}")

musalbas · 2023-08-15T09:56:15Z

So you're saying you only need 69 light nodes if the block producer does an explicit block withholding attack (see below), but in normal/random cases where they aren't explicitly withholding data in the shape of a square, you only need 32?

If we wanted this to be super accurate, maybe we could control what shares the light nodes are sampling, so that the tests don't have any randomness to them.

distractedm1nd · 2023-08-15T11:13:37Z

@musalbas That is my current understanding - running the simulation to collect at least (k + 1)^2 shares gives me the "correct" parameters - k = 16 always fails, k=18 always passes, which is consistent with the sim.

I am assuming that gamma represents some case that either happens very rarely, or if data is withheld in a clever way? Experimentally using completely random shares, we only need (k + 1)^2

musalbas · 2023-08-15T11:32:06Z

Yes it assumes that the adversary is withholding the shares exactly in the shape of a square (see diagram above). In that case you actually need (2k)^2 - (k+1)^2 + 1 shares.

The amount of shares that light nodes need ultimately depends on what shares exactly are missing. Technically the minimum they need is not even (k+1)^2, but k^2.

walldiss · 2023-08-16T17:12:24Z

Sorry for jumping in, but I got super curious to simulation too. Actually previous script could result in less then 20 shares per node, since some shares can overlap within the node, if loop is limited by 20. I've build small util to play with params. Looks like 69 is the amount of light nodes to prevent withholding attack. Here are results:

amount of nodes, probability
58 0
59 0.001
60 0.006
61 0.039
62 0.093
63 0.253
64 0.473
65 0.676
66 0.828
67 0.926
68 0.974
69 0.993
70 0.998

func main() {
	k := 16
	shares := 20
	iter := 1000
	enough := (2*k)*(2*k) - (k+1)*(k+1) + 1
	lights := 71

	for l := 50; l < lights; l++ {
		count := 0
		for i := 0; i < iter; i++ {
			uniq := make(map[int]int)
			// loop over nodes
			for n := 0; n < l; n++ {
				// elems stores shares for single node
				elems := make(map[int]int)
				//loop until 20 uniq shares are stored
				for {
					r := rand.Intn(k * k * 4)
					elems[r]++
					uniq[r]++
					if len(elems) >= shares {
						break
					}
				}
			}
			if len(uniq) >= enough {
				count++
			}
		}

		probability := float64(count) / float64(iter)
		fmt.Println(l, probability)
	}
}

Wondertan · 2023-08-17T08:17:04Z

I see a few options to proceed here:

Keep the values from @distractedm1nd's simulations properly documenting why are they different from the paper
Implement a more granular sampling test suite to reduce randomness yet keeping figures as close as possible described in the paper
Extend the test suite with withholding attack via source. That is, make the source node serve only shares outside of the square shown in the diagram above.

Personally, I prefer the 3rd option as it tests the worst case scenario, should be deterministic and 1 to 1 to the paper. Also, it should be easier to implement than the second option + potentially withholding blockstore can be reused for swamp tests.

distractedm1nd · 2023-08-17T08:19:08Z

They are not necessarily different from the paper, they're just a different part of the paper. I'll just add another test that withholds so we can test the third case, but let's do that in another PR?

musalbas · 2023-08-17T09:39:36Z

Agree that the 3rd option makes most sense

Wondertan · 2023-08-17T11:06:54Z

I'll just add another test that withholds so we can test the third case, but let's do that in another PR?

Thinking about this more. All other reconstruction tests use similar ~69 figures, meaning they are not fully correct. Also, I don't think with and without withholding attacks are separate test cases. It's more like the current reconstruction tests are not fully there yet, and I am delighted that you identified issues with them. I think that we should just extend the current testsuite so that the source node always withholds data so that all the tests are fixed. This can definitely be in a separate PR, and this one is good as it(besides timeout nit) is, so approving.

TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused it to be a false positive. The following issues were fixed: 1. The parameters used now don't allow for reconstruction from a single subnetwork, as they did before. This has been verified with a simulation - for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before, both values of `c` gave a near 100% chance for reconstruction. 2. Increased the timeout for the first reconstruction. There was a check that verified that reconstruction was not possible from a single subnetwork, but it only failed to reconstruct because the timeout was too short. Increasing the timeout with the original parameters verified this. 3. Shape the network topology before the first reconstruction (for example, separating the LNs into subnets) 4. Calling final reconstruction in an error group. The full nodes are codependent on each other for either one to reconstruct. Calling them sequentially fails, because the first full node needs the second full node to have sampled in order to get shares from it. This was another signal that the full nodes could reconstruct from a single subnetwork, and it has now been fixed by allowing both FNs to sample concurrently.

TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused it to be a false positive. The following issues were fixed: 1. The parameters used now don't allow for reconstruction from a single subnetwork, as they did before. This has been verified with a simulation - for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before, both values of `c` gave a near 100% chance for reconstruction. 2. Increased the timeout for the first reconstruction. There was a check that verified that reconstruction was not possible from a single subnetwork, but it only failed to reconstruct because the timeout was too short. Increasing the timeout with the original parameters verified this. 3. Shape the network topology before the first reconstruction (for example, separating the LNs into subnets) 4. Calling final reconstruction in an error group. The full nodes are codependent on each other for either one to reconstruct. Calling them sequentially fails, because the first full node needs the second full node to have sampled in order to get shares from it. This was another signal that the full nodes could reconstruct from a single subnetwork, and it has now been fixed by allowing both FNs to sample concurrently. (cherry picked from commit 736763e)

fix: TestShareAvailable_DisconnectedFullNodes

1c5f39c

distractedm1nd added kind:testing Related to unit tests kind:fix Attached to bug-fixing PRs labels Aug 11, 2023

distractedm1nd self-assigned this Aug 11, 2023

distractedm1nd requested review from renaynay, Wondertan, vgonkivs and walldiss as code owners August 11, 2023 10:14

renaynay approved these changes Aug 11, 2023

View reviewed changes

Wondertan reviewed Aug 11, 2023

View reviewed changes

share/availability/full/reconstruction_test.go Show resolved Hide resolved

renaynay mentioned this pull request Aug 11, 2023

chore(deps): bump ipfs networking deps #2478

Merged

Wondertan reviewed Aug 14, 2023

View reviewed changes

Wondertan approved these changes Aug 17, 2023

View reviewed changes

renaynay merged commit 736763e into main Aug 17, 2023
16 of 19 checks passed

renaynay deleted the disconnectedfullnodes branch August 17, 2023 13:43

distractedm1nd mentioned this pull request Aug 21, 2023

tests(availability): Withholding attacks in availability test suite #2592

Open

faddat mentioned this pull request Feb 22, 2024

feat: go 1.22 and pebble v1.1.0 for celestia-node #3202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: TestShareAvailable_DisconnectedFullNodes #2560

fix: TestShareAvailable_DisconnectedFullNodes #2560

distractedm1nd commented Aug 11, 2023

Wondertan commented Aug 11, 2023

Wondertan left a comment •

edited

Loading

distractedm1nd commented Aug 14, 2023

distractedm1nd commented Aug 14, 2023

musalbas commented Aug 15, 2023 •

edited

Loading

distractedm1nd commented Aug 15, 2023

musalbas commented Aug 15, 2023

walldiss commented Aug 16, 2023 •

edited

Loading

Wondertan commented Aug 17, 2023 •

edited

Loading

distractedm1nd commented Aug 17, 2023

musalbas commented Aug 17, 2023

Wondertan commented Aug 17, 2023 •

edited

Loading

fix: TestShareAvailable_DisconnectedFullNodes #2560

fix: TestShareAvailable_DisconnectedFullNodes #2560

Conversation

distractedm1nd commented Aug 11, 2023

Wondertan commented Aug 11, 2023

Wondertan left a comment • edited Loading

Choose a reason for hiding this comment

distractedm1nd commented Aug 14, 2023

distractedm1nd commented Aug 14, 2023

musalbas commented Aug 15, 2023 • edited Loading

distractedm1nd commented Aug 15, 2023

musalbas commented Aug 15, 2023

walldiss commented Aug 16, 2023 • edited Loading

Wondertan commented Aug 17, 2023 • edited Loading

distractedm1nd commented Aug 17, 2023

musalbas commented Aug 17, 2023

Wondertan commented Aug 17, 2023 • edited Loading

Wondertan left a comment •

edited

Loading

musalbas commented Aug 15, 2023 •

edited

Loading

walldiss commented Aug 16, 2023 •

edited

Loading

Wondertan commented Aug 17, 2023 •

edited

Loading

Wondertan commented Aug 17, 2023 •

edited

Loading