Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(full/availability): adding different withholding options to tests #2697

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

distractedm1nd
Copy link
Collaborator

@distractedm1nd distractedm1nd commented Sep 11, 2023

Closes #2592 by adding three different modes to full availability test nodes: FullyRecoverable, BarelyRecoverable, and Unrecoverable.

BarelyRecoverable test case requires 230 nodes for 99% chance of recoverability:
image

BarelyRecoverable is not used in DisconnectedFullNodes, only in ConnectedFullNodes, because as you can see by the graph by splitting the 230 node count in two we are not ensured failed recoverability.

@ramin ramin self-requested a review September 13, 2023 09:14
ramin
ramin previously approved these changes Sep 13, 2023
Copy link
Contributor

@ramin ramin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had a good "in person" review of this that was convincing. I'm curious about the effectiveness of the "2 lightnode subnet split brain" test based on how the removed blocks are provisioned from Unrecoverable, but maybe i'm imagining a test that doesn't exist yet (ie: a full node transitions from failing to reconstruct from 1 subnet, to being able to with 2 subnets) but this might just be my limited grasp

share/availability/full/reconstruction_test.go Outdated Show resolved Hide resolved
share/availability/full/reconstruction_test.go Outdated Show resolved Hide resolved
share/availability/full/reconstruction_test.go Outdated Show resolved Hide resolved
share/availability/full/reconstruction_test.go Outdated Show resolved Hide resolved
share/availability/full/reconstruction_test.go Outdated Show resolved Hide resolved
share/availability/full/reconstruction_test.go Outdated Show resolved Hide resolved
}

err := errg.Wait()
require.NoError(t, err)
}

// TestShareAvailable_DisconnectedFullNodes asserts that two disconnected full
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought: Just curious, if it would still work same for similar topology, but for more than 2 full nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought, we need to park this in an issue. @musalbas requested the ability to scale the test in such a way.

share/availability/test/withholding_adder.go Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #2697 (416d090) into main (7c1f8f3) will decrease coverage by 0.05%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##             main    #2697      +/-   ##
==========================================
- Coverage   51.37%   51.33%   -0.05%     
==========================================
  Files         159      159              
  Lines       10699    10709      +10     
==========================================
  Hits         5497     5497              
- Misses       4723     4734      +11     
+ Partials      479      478       -1     
Files Changed Coverage Δ
share/availability/full/testing.go 93.33% <83.33%> (-6.67%) ⬇️

... and 7 files with indirect coverage changes

// false, otherwise it acts the same but keeps the node at index (k + 1, k + 1).
func FillWithheldBS(t *testing.T, n int, bServ blockservice.BlockService, recoverable bool) *share.Root {
shares := sharetest.RandShares(t, n*n)
eds, err := AddAndWithholdShares(context.TODO(), shares, bServ, recoverable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this a context.WithCancel and then t.Cleanup the cancel

{
name: "barely recoverable",
origSquareSize: 16,
lightNodes: 230, // 99% chance of recoverability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]
Where is 230 coming from?

  • 24 light nodes(c) are enough when data is not withheld
  • Table 1 in the paper states 69 for light nodes(c) as the number for the same k and s as in our test.
  • The graph below does not have it either
    image
  • 2k2k = 1024 total shares in square; 23020 = 4600 max shares clients would sample. Seems too much,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this case is not covered in the paper. Refer to my description, and I can send you the script so you can verify it for yourself

Copy link
Member

@Wondertan Wondertan Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description didn't describe where 230 is coming from and thus the question.

Again, this case is not covered in the paper
Okay, then I missed some discussion around it because I don't remember such a fact. In my memory, figures from the paper assume withholding attack as it is extensively described in theorems in the same sections as all the plots I am referring to.

Please share the script, I am curious about how it's constructed and would like to play and verify with it. The number 230 feels too much for me when 23 is enough to reconstruct the whole square when data isn't withheld. Did you verify the script with @musalbas

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 is not enough to reconstruct the whole square with high probability in this case and matches my simulations, which you can verify running yourself. Will send the script over slack

Copy link
Member

@Wondertan Wondertan Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant 23/24 when data is not withheld, updated the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gist.github.com/distractedm1nd/d2b306da13581b97bbbcefaa5b050738
I apologize for the shitty code, didnt expect it to be read by others :)

Copy link
Member

@walldiss walldiss Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main loop flow in montle carlo script looks familiar 🙂
If ds.sample(s) would act as noop (for example for share that is withheld), it would mean that given share is not sampled, but you count it as sampled here:

for len(uniqueSamples) < shares {
	s := randomSample(k)
	uniqueSamples[s]++
	ds.sample(s)
}

Could you please clarify, why you count it as sampled, when it is not? Looks like it could lead to less than 20 samples being assigned to each light node and make required amount of light nodes higher that it needs to be.
I have tried to modify the code to be as following, to ensure 20 samples per every light node:

for len(uniqueSamples) < shares {
	s := randomSample(k)
	if ds.sample(s) {
		uniqueSamples[s]++
	}
}

On adjusted calculation algo, probability graph shows, that there is decent chance to do reconstruction for withholding case on amount of light node lower than 230:
image

Copy link
Collaborator Author

@distractedm1nd distractedm1nd Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I tried to match it with the current implementation that does not change the coordinates to sample after failure but just retries. Or is this incorrect?

In any case, seeing how much the numbers improve after this change indicates it should be what we are doing both in IPLD retrieval as well as here in the test. Thank you for looking deeper into this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked light availability and seems like it is acting like neither of simulations. 😆 I'll drop my finding into new issue to discuss further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil {
return nil, err
}
// commit the batch to ipfs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[uber nit]

Suggested change
// commit the batch to ipfs
// commit the batch to ipld

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this looks like a copy of a AddShares. I'm curious why we copy as delete functionality in AddAndWithholdShares looks like we can compose it over the AddShares func.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was only copied in the first case because the intention was to use some internal state, but the most recent version just iterates over the adder and ipld Translates the DAH.

So I will see if I can just call AddShares and do the iteration after

share/availability/test/withholding_adder.go Show resolved Hide resolved
}

err := errg.Wait()
require.NoError(t, err)
}

// TestShareAvailable_DisconnectedFullNodes asserts that two disconnected full
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought, we need to park this in an issue. @musalbas requested the ability to scale the test in such a way.

// FullyRecoverable makes all EDS shares available when filling the
// blockservice.
FullyRecoverable Recoverability = iota
// BarelyRecoverable withholds the (k + 1)^2 subsquare of the 2k^2 EDS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doc uber nit]
re: subsquare. We call em quadrants and just to decrease confusion we should continue to call em so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not mean quadrants, look to the image in desc to see what I am referring to

share/availability/full/testing.go Show resolved Hide resolved
@@ -26,10 +26,39 @@ func GetterWithRandSquare(t *testing.T, n int) (share.Getter, *share.Root) {
return getter, availability_test.RandFillBS(t, n, bServ)
}

type Recoverability int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[discussion]

Dropping the alternative naming, I have in my mind. There is no actionable here.
DataWithholding(FullyWitheld, PartialyWithheld, UnWithheld)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this

net.Connect(lights2[i].ID(), full2.ID())
net.Disconnect(lights2[i].ID(), full1.ID())
}
// start connection lights with sources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// start connection lights with sources
// connect lights with sources

sampleAmount: 20,
recoverability: BarelyRecoverable,
expectedFailure: true,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we omit unrecoverable case here?

Comment on lines +157 to +162
// NOTE: This test contains cases for barely recoverable but
// DisconnectedFullNodes does not. The reasoning for this is that
// DisconnectedFullNodes has the additional contstraint that the data
// should not be reconstructable from a single subnetwork, while this
// test only tests that the data is reconstructable once the subnetworks
// are connected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: This test contains cases for barely recoverable but
// DisconnectedFullNodes does not. The reasoning for this is that
// DisconnectedFullNodes has the additional contstraint that the data
// should not be reconstructable from a single subnetwork, while this
// test only tests that the data is reconstructable once the subnetworks
// are connected.
// NOTE: This test contains cases for barely recoverable but
// DisconnectedFullNodes does not. The reasoning for this is that
// DisconnectedFullNodes has the additional constraint that the data
// should not be reconstructible from a single subnetwork, while this
// test only tests that the data is reconstructible once the subnetworks
// are connected.

As per Grammarly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need help understanding the reasoning here. Why DisconnectedFullNodes would not have a BarelyRecoverable case? BarelyReconstructible is the worst-case success scenario that, IMO, we should test. Two subnetworks sample the available minimum and the fulls collectively reconstruct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to my description again to see - we also discussed this internally in slack. In order to have 99% confidence with BarelyRecoverable in DisconnectedFullNodes, you cannot use 2 subnetworks, as each subnetwork has a too high probability of the data being reconstructible. This is fine, as it is exactly what ConnectedFullNodes intends to test - that the data is available once connecting the full nodes, WITHOUT the assumption that it is unrecoverable from the independent subnetworks.

Copy link
Member

@Wondertan Wondertan Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is being repeated. I still would like to get to the bottom of this.

In order to have 99% confidence with BarelyRecoverable in DisconnectedFullNodes, you cannot use 2 subnetworks, as each subnetwork has a too high probability of the data being reconstructible

Isn't that a sign that 230 is too big of a number if the 115 subnetwork has a very high chance of reconstructing? In other words, is there a number that would give 99% for subnetwork, but ~0% for the whole network? Is there a way to modify the script to support that? Note that my intuition for probabilities is not good.

{
name: "fully recoverable",
origSquareSize: 16,
lightNodes: 24, // ~99% chance of recoverability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So those tests are flaky by nature, which is not good for CI. I did quick math with current suite and it seems those combined have 3% chance to fail. 0.990.99(1-0.007)*(1-0.003) = 0.9703

We should do something about it, but also I think it could be resolved by separate issue. One way to deal with it would be to separate all flaky test into separate CI runner, that is not blocking merge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and we can adjust the LN count to be less at the threshold to decrease flakiness further

{
name: "barely recoverable",
origSquareSize: 16,
lightNodes: 230, // 99% chance of recoverability
Copy link
Member

@walldiss walldiss Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main loop flow in montle carlo script looks familiar 🙂
If ds.sample(s) would act as noop (for example for share that is withheld), it would mean that given share is not sampled, but you count it as sampled here:

for len(uniqueSamples) < shares {
	s := randomSample(k)
	uniqueSamples[s]++
	ds.sample(s)
}

Could you please clarify, why you count it as sampled, when it is not? Looks like it could lead to less than 20 samples being assigned to each light node and make required amount of light nodes higher that it needs to be.
I have tried to modify the code to be as following, to ensure 20 samples per every light node:

for len(uniqueSamples) < shares {
	s := randomSample(k)
	if ds.sample(s) {
		uniqueSamples[s]++
	}
}

On adjusted calculation algo, probability graph shows, that there is decent chance to do reconstruction for withholding case on amount of light node lower than 230:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:ipld IPLD plugin kind:testing Related to unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests(availability): Withholding attacks in availability test suite
6 participants