Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: TestShareAvailable_DisconnectedFullNodes #2560

Merged
merged 1 commit into from
Aug 17, 2023
Merged

Conversation

distractedm1nd
Copy link
Collaborator

TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused it to be a false positive. The following issues were fixed:

  1. The parameters used now don't allow for reconstruction from a single subnetwork, as they did before. This has been verified with a simulation - for s = 20, and k = 16, c set to 32 gives a 99.5% chance of reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before, both values of c gave a near 100% chance for reconstruction.
  2. Increased the timeout for the first reconstruction. There was a check that verified that reconstruction was not possible from a single subnetwork, but it only failed to reconstruct because the timeout was too short. Increasing the timeout with the original parameters verified this.
  3. Shape the network topology before the first reconstruction (for example, separating the LNs into subnets)
  4. Calling final reconstruction in an error group. The full nodes are codependent on each other for either one to reconstruct. Calling them sequentially fails, because the first full node needs the second full node to have sampled in order to get shares from it. This was another signal that the full nodes could reconstruct from a single subnetwork, and it has now been fixed by allowing both FNs to sample concurrently.

@distractedm1nd distractedm1nd added kind:testing Related to unit tests kind:fix Attached to bug-fixing PRs labels Aug 11, 2023
@distractedm1nd distractedm1nd self-assigned this Aug 11, 2023
@Wondertan
Copy link
Member

Thanks for this, @distractedm1nd! Before we merge it, I would like to make a deeper review of changes. Can this wait until Monday?
Cc @renaynay

Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions and recalling reasons behind why some things were done in that way:

  1. I am trying to understand why 32. I presume you run some simulations using python script from Mus, but this is different from values on the paper. Table 1 states that for k=16 and s=20, c is 69 with 99% success. Why are these different?

  2. The quadrant timeout multiplied by the number of quadrants(4*2 axis) should work as it's the total time Retriever spends anyway. I didn't want tests to run more than necessary. The problem is in the init func at the top of the file that sets Retriever timeout to a turned out to be a low value. I think I didn't catch that due to false positive success.

  3. I remember doing such an order purposefully. In other tests, we start reconstruction first and form the topology after in order to test that FNs are capable to reconstruct with LNs joining after the reconstruction starts. In practice, a FN may not have enough LNs to reconstruct, but more LNs would discover it, connect, and serve missing shares. However, while this is required for other tests, it's not required for this test scenario.

  4. Makes total sense and is a great find!

@distractedm1nd
Copy link
Collaborator Author

  1. I know, - the problem here is that \gamma is IIUC referring to withholding attack. In our case, we only need (k + 1)^2 + 1 shares to ensure full recovery. We should probably write tests including malicious withholding to back this up.
  2. Okay, if the timeouts are minimal enough and aren't flaky then I will revert
  3. Ok
  4. <3

cc @musalbas

@distractedm1nd
Copy link
Collaborator Author

For reference, here is my sim:

import random

k = 16
c = 20
s = 16
total_elements = (2 * k) ** 2
count = 0

for _ in range(10000):
    unique_elements_picked = set()
    for _ in range(c):
        for _ in range(s):
            element = random.randint(0, total_elements - 1)
            unique_elements_picked.add(element)
    if len(unique_elements_picked) > (k + 1) ** 2:
        count += 1

probability = count / iterations

print(f"Estimated Probability: {probability}")

@musalbas
Copy link
Member

musalbas commented Aug 15, 2023

So you're saying you only need 69 light nodes if the block producer does an explicit block withholding attack (see below), but in normal/random cases where they aren't explicitly withholding data in the shape of a square, you only need 32?

image

If we wanted this to be super accurate, maybe we could control what shares the light nodes are sampling, so that the tests don't have any randomness to them.

@distractedm1nd
Copy link
Collaborator Author

@musalbas That is my current understanding - running the simulation to collect at least (k + 1)^2 shares gives me the "correct" parameters - k = 16 always fails, k=18 always passes, which is consistent with the sim.

I am assuming that gamma represents some case that either happens very rarely, or if data is withheld in a clever way? Experimentally using completely random shares, we only need (k + 1)^2

@musalbas
Copy link
Member

Yes it assumes that the adversary is withholding the shares exactly in the shape of a square (see diagram above). In that case you actually need (2k)^2 - (k+1)^2 + 1 shares.

The amount of shares that light nodes need ultimately depends on what shares exactly are missing. Technically the minimum they need is not even (k+1)^2, but k^2.

@walldiss
Copy link
Member

walldiss commented Aug 16, 2023

Sorry for jumping in, but I got super curious to simulation too. Actually previous script could result in less then 20 shares per node, since some shares can overlap within the node, if loop is limited by 20. I've build small util to play with params. Looks like 69 is the amount of light nodes to prevent withholding attack. Here are results:

amount of nodes, probability
58 0
59 0.001
60 0.006
61 0.039
62 0.093
63 0.253
64 0.473
65 0.676
66 0.828
67 0.926
68 0.974
69 0.993
70 0.998
func main() {
	k := 16
	shares := 20
	iter := 1000
	enough := (2*k)*(2*k) - (k+1)*(k+1) + 1
	lights := 71

	for l := 50; l < lights; l++ {
		count := 0
		for i := 0; i < iter; i++ {
			uniq := make(map[int]int)
			// loop over nodes
			for n := 0; n < l; n++ {
				// elems stores shares for single node
				elems := make(map[int]int)
				//loop until 20 uniq shares are stored
				for {
					r := rand.Intn(k * k * 4)
					elems[r]++
					uniq[r]++
					if len(elems) >= shares {
						break
					}
				}
			}
			if len(uniq) >= enough {
				count++
			}
		}

		probability := float64(count) / float64(iter)
		fmt.Println(l, probability)
	}
}

@Wondertan
Copy link
Member

Wondertan commented Aug 17, 2023

I see a few options to proceed here:

  • Keep the values from @distractedm1nd's simulations properly documenting why are they different from the paper
  • Implement a more granular sampling test suite to reduce randomness yet keeping figures as close as possible described in the paper
  • Extend the test suite with withholding attack via source. That is, make the source node serve only shares outside of the square shown in the diagram above.

Personally, I prefer the 3rd option as it tests the worst case scenario, should be deterministic and 1 to 1 to the paper. Also, it should be easier to implement than the second option + potentially withholding blockstore can be reused for swamp tests.

@distractedm1nd
Copy link
Collaborator Author

They are not necessarily different from the paper, they're just a different part of the paper. I'll just add another test that withholds so we can test the third case, but let's do that in another PR?

@musalbas
Copy link
Member

Agree that the 3rd option makes most sense

@Wondertan
Copy link
Member

Wondertan commented Aug 17, 2023

I'll just add another test that withholds so we can test the third case, but let's do that in another PR?

Thinking about this more. All other reconstruction tests use similar ~69 figures, meaning they are not fully correct. Also, I don't think with and without withholding attacks are separate test cases. It's more like the current reconstruction tests are not fully there yet, and I am delighted that you identified issues with them. I think that we should just extend the current testsuite so that the source node always withholds data so that all the tests are fixed. This can definitely be in a separate PR, and this one is good as it(besides timeout nit) is, so approving.

@renaynay renaynay merged commit 736763e into main Aug 17, 2023
16 of 19 checks passed
@renaynay renaynay deleted the disconnectedfullnodes branch August 17, 2023 13:43
renaynay pushed a commit to renaynay/celestia-node that referenced this pull request Aug 23, 2023
TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused
it to be a false positive. The following issues were fixed:

1. The parameters used now don't allow for reconstruction from a single
subnetwork, as they did before. This has been verified with a simulation
- for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of
reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before,
both values of `c` gave a near 100% chance for reconstruction.
2. Increased the timeout for the first reconstruction. There was a check
that verified that reconstruction was not possible from a single
subnetwork, but it only failed to reconstruct because the timeout was
too short. Increasing the timeout with the original parameters verified
this.
3. Shape the network topology before the first reconstruction (for
example, separating the LNs into subnets)
4. Calling final reconstruction in an error group. The full nodes are
codependent on each other for either one to reconstruct. Calling them
sequentially fails, because the first full node needs the second full
node to have sampled in order to get shares from it. This was another
signal that the full nodes could reconstruct from a single subnetwork,
and it has now been fixed by allowing both FNs to sample concurrently.
walldiss pushed a commit to walldiss/celestia-node that referenced this pull request Sep 22, 2023
TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused
it to be a false positive. The following issues were fixed:

1. The parameters used now don't allow for reconstruction from a single
subnetwork, as they did before. This has been verified with a simulation
- for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of
reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before,
both values of `c` gave a near 100% chance for reconstruction.
2. Increased the timeout for the first reconstruction. There was a check
that verified that reconstruction was not possible from a single
subnetwork, but it only failed to reconstruct because the timeout was
too short. Increasing the timeout with the original parameters verified
this.
3. Shape the network topology before the first reconstruction (for
example, separating the LNs into subnets)
4. Calling final reconstruction in an error group. The full nodes are
codependent on each other for either one to reconstruct. Calling them
sequentially fails, because the first full node needs the second full
node to have sampled in order to get shares from it. This was another
signal that the full nodes could reconstruct from a single subnetwork,
and it has now been fixed by allowing both FNs to sample concurrently.

(cherry picked from commit 736763e)
walldiss pushed a commit that referenced this pull request Sep 22, 2023
TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused
it to be a false positive. The following issues were fixed:

1. The parameters used now don't allow for reconstruction from a single
subnetwork, as they did before. This has been verified with a simulation
- for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of
reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before,
both values of `c` gave a near 100% chance for reconstruction.
2. Increased the timeout for the first reconstruction. There was a check
that verified that reconstruction was not possible from a single
subnetwork, but it only failed to reconstruct because the timeout was
too short. Increasing the timeout with the original parameters verified
this.
3. Shape the network topology before the first reconstruction (for
example, separating the LNs into subnets)
4. Calling final reconstruction in an error group. The full nodes are
codependent on each other for either one to reconstruct. Calling them
sequentially fails, because the first full node needs the second full
node to have sampled in order to get shares from it. This was another
signal that the full nodes could reconstruct from a single subnetwork,
and it has now been fixed by allowing both FNs to sample concurrently.

(cherry picked from commit 736763e)
walldiss pushed a commit to walldiss/celestia-node that referenced this pull request Sep 22, 2023
TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused
it to be a false positive. The following issues were fixed:

1. The parameters used now don't allow for reconstruction from a single
subnetwork, as they did before. This has been verified with a simulation
- for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of
reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before,
both values of `c` gave a near 100% chance for reconstruction.
2. Increased the timeout for the first reconstruction. There was a check
that verified that reconstruction was not possible from a single
subnetwork, but it only failed to reconstruct because the timeout was
too short. Increasing the timeout with the original parameters verified
this.
3. Shape the network topology before the first reconstruction (for
example, separating the LNs into subnets)
4. Calling final reconstruction in an error group. The full nodes are
codependent on each other for either one to reconstruct. Calling them
sequentially fails, because the first full node needs the second full
node to have sampled in order to get shares from it. This was another
signal that the full nodes could reconstruct from a single subnetwork,
and it has now been fixed by allowing both FNs to sample concurrently.

(cherry picked from commit 736763e)
walldiss pushed a commit to walldiss/celestia-node that referenced this pull request Sep 25, 2023
TestShareAvailable_DisconnectedFullNodes had multiple bugs which caused
it to be a false positive. The following issues were fixed:

1. The parameters used now don't allow for reconstruction from a single
subnetwork, as they did before. This has been verified with a simulation
- for `s = 20`, and `k = 16`, `c` set to 32 gives a 99.5% chance of
reconstruction, and 16 (single subnetwork) gives a 0.5% chance. Before,
both values of `c` gave a near 100% chance for reconstruction.
2. Increased the timeout for the first reconstruction. There was a check
that verified that reconstruction was not possible from a single
subnetwork, but it only failed to reconstruct because the timeout was
too short. Increasing the timeout with the original parameters verified
this.
3. Shape the network topology before the first reconstruction (for
example, separating the LNs into subnets)
4. Calling final reconstruction in an error group. The full nodes are
codependent on each other for either one to reconstruct. Calling them
sequentially fails, because the first full node needs the second full
node to have sampled in order to get shares from it. This was another
signal that the full nodes could reconstruct from a single subnetwork,
and it has now been fixed by allowing both FNs to sample concurrently.

(cherry picked from commit 736763e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:fix Attached to bug-fixing PRs kind:testing Related to unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants