tests: Migrate member add tests to common framework #14281

clement2026 · 2022-07-27T08:33:21Z

Conctext #13637

serathius · 2022-07-27T09:17:27Z

tests/e2e/ctl_v3_member_test.go

-func TestCtlV3MemberAddPeerTLS(t *testing.T) {
-	testCtl(t, memberAddTest, withCfg(*e2e.NewConfigPeerTLS()))
-}
+func TestCtlV3MemberAdd(t *testing.T)           { testCtl(t, memberAddTest) }
 func TestCtlV3MemberAddForLearner(t *testing.T) { testCtl(t, memberAddForLearnerTest) }


Isn't this also covered?

The underlying implementation of TestCtlV3MemberAdd and TestCtlV3MemberAddForLearner use member add and member add --learner command.

I thought it would be valuable to keep them in the e2e testing, to cover the real command usage scenarios.

Do we need to keep them?

Let me know if I understood you. New framework runs e2e test, but runs etcdctl only with --format=json, so we never real command usage secnario etcdctl member add, but flavor that uses json?.

Oh, I've got the point.

I thought our common framework doesn't run e2e test. Now I know it does.

I'm going to remove these redundant functions.

I thought our common framework doesn't run e2e test. Now I know it does.

Common framework is an interface for testing that covers both e2e and integration. Tests in tests/common that use common framework will be automatically run as both e2e and integration tests. Write a test once and double the benefit.

Instead of having two separate frameworks and redundant tests, we want to just have one framework with shared tests. For now we are migrating tests/e2e directory, however if you want you can look into tests/integration directory and check if some test scenarios are already covered and can be removed.

It's a good point that we don't test exact output of etcdctl member add, however that's much lower on our priority than testing e2e process.

codecov-commenter · 2022-07-27T09:17:58Z

Codecov Report

Merging #14281 (d05c894) into main (012fc51) will decrease coverage by 0.27%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #14281      +/-   ##
==========================================
- Coverage   75.49%   75.22%   -0.28%     
==========================================
  Files         457      457              
  Lines       37084    37084              
==========================================
- Hits        27996    27895     -101     
- Misses       7341     7427      +86     
- Partials     1747     1762      +15

Flag	Coverage Δ
all	`75.22% <ø> (-0.28%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/auth/simple_token.go	`80.00% <0.00%> (-8.47%)`	⬇️
client/pkg/v3/tlsutil/tlsutil.go	`83.33% <0.00%> (-8.34%)`	⬇️
client/pkg/v3/fileutil/purge.go	`66.03% <0.00%> (-7.55%)`	⬇️
client/v3/concurrency/mutex.go	`61.64% <0.00%> (-5.48%)`	⬇️
raft/rafttest/node.go	`95.00% <0.00%> (-5.00%)`	⬇️
server/storage/mvcc/watchable_store.go	`84.42% <0.00%> (-4.72%)`	⬇️
server/etcdserver/api/v3rpc/interceptor.go	`73.43% <0.00%> (-4.17%)`	⬇️
server/proxy/grpcproxy/watch.go	`92.48% <0.00%> (-4.05%)`	⬇️
server/etcdserver/txn/util.go	`75.47% <0.00%> (-3.78%)`	⬇️
client/pkg/v3/testutil/recorder.go	`76.27% <0.00%> (-3.39%)`	⬇️
... and 21 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

serathius · 2022-07-27T09:20:55Z

Found errror:

member_test.go:94: could not add member, err: match not found. Set EXPECT_DEBUG for more info Err: read /dev/ptmx: input/output error, last lines:

Let me know if you need help with debugging.

clement2026 · 2022-07-27T11:53:45Z

This clue shows that a member of a cluster starts serving client before it's connected to all peers.

/home/runner/work/etcd/etcd/bin/etcd-22812: {"level":"warn","ts":"2022-07-27T08:46:33.653Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"a8f3458387986d","requested-member-add":"{ID:557cfa84711678b8 RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14847](https://github.com/etcd-io/etcd/runs/7536399620?check_suite_focus=true#step:5:14848)

To add or to remove a member from a cluster, a client need to wait a little bit. I've reproduced it on my computer.

To correct the test cases, I'm gonna keep trying MemberAdd() for a few seconds.

serathius · 2022-07-27T11:58:17Z

Hmm, maybe we should use clus.WaitLeader()?

serathius · 2022-07-27T12:01:23Z

One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.

On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.

clement2026 · 2022-07-27T14:27:53Z

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

clement2026 · 2022-07-27T14:33:20Z

One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.

On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.

sg.

serathius · 2022-07-27T14:50:58Z

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.

clement2026 · 2022-07-28T07:35:35Z

As I went through the code base, I found that clus.WaitLeader() is called in many integration test cases, not in any e2e test case or the process of cluster creation.
Neither in old e2e test code nor in common framework e2e code, an equivalent function to clus.WaitLeader() was found. 😟

clement2026 · 2022-07-28T07:50:45Z

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.

Skipping test cases where cluster size is equal 1 is not likely going to make the tests work. The failed test cases at the moment are the ones with cluster size > 1.

When cluster size=1, the single node accepts MemberAdd request immediately and responses OK, ignoring the fact that the newly added backhole peer url would make the cluster unavailable.

serathius · 2022-07-28T15:08:25Z

I would recommend to rebase as there was a change that greatly reduces test flakiness #14283

clement2026 · 2022-07-28T16:17:02Z

Maybe we need to implement a clus.WaitLeader() function in e2e cluster, and expose it in Cluster interface. What do you think?

type Cluster interface {
	Members() []Member
        // HERE
	WaitLeader() int
	Client() Client
	Close() error
}

type Member interface {
	Client() Client
	Start() error
	Stop()
}

type Client interface {
	Put(key, value string, opts config.PutOptions) error
	Get(key string, opts config.GetOptions) (*clientv3.GetResponse, error)
	Delete(key string, opts config.DeleteOptions) (*clientv3.DeleteResponse, error)
	Compact(rev int64, opts config.CompactOption) (*clientv3.CompactResponse, error)
	Status() ([]*clientv3.StatusResponse, error)
	HashKV(rev int64) ([]*clientv3.HashKVResponse, error)
	Health() error
	Defragment(opts config.DefragOption) error
	AlarmList() (*clientv3.AlarmResponse, error)
	AlarmDisarm(alarmMember *clientv3.AlarmMember) (*clientv3.AlarmResponse, error)

ahrtr · 2022-08-08T21:05:10Z

Do you need to wait for #14304 to be merged and then update this PR?

clement2026 · 2022-08-09T03:33:31Z

Do you need to wait for #14304 to be merged and then update this PR?

Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?

ahrtr · 2022-08-09T03:59:51Z

Do you need to wait for #14304 to be merged and then update this PR?

Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?

Once #14304 is merged, you can rebase this PR.

ahrtr · 2022-08-15T00:16:38Z

tests/common/member_test.go

+					var addResp *clientv3.MemberAddResponse
+					var err error
+					if tc.learner {
+						addResp, err = cc.MemberAddAsLearner("newmember", []string{blackHolePeerUrl})


Why not to use a real peerURL something like http://localhost:xxx instead of http://240.0.0.0:65535?

http://240.0.0.0:65535 is used to avoid "Peer URLs already exists" error. If http://localhost:xxx happens to be the peer-url of a started etcd instance, member add operation would fail due to that error. I considered to use http://localhost:xxx with a unused random port such as 2777 and 15981, but I'm slightly worried about port conflict in the future.

clement2026 · 2022-08-18T09:36:46Z

The E2E / test (linux-386-e2e) (pull_request) check failed due to an unhealthy cluster error

/home/runner/work/etcd/etcd/bin/etcd-22730: {"level":"warn","ts":"2022-08-14T11:45:32.476Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"d70245c72180528a","requested-member-add":"{ID:ccf9af529e52be7e RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14842](https://github.com/etcd-io/etcd/runs/7826140582?check_suite_focus=true#step:5:14843)

MemberAdd function worked fine in integration test, but failed in e2e test. That's because integration cluster has --strict-reconfig-check disabled by default and e2e cluster has `--strict-reconfig-check' enabled by default.

Therefore, in e2e cluster, MemberAdd function has to wait 5 seconds before e2e cluster isConnectedFullySince, or an unhealthy cluster error would occur.

etcd/server/etcdserver/server.go

Lines 1350 to 1358 in d05c894

    
           if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) { 
        
           	lg.Warn( 
        
           		"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum", 
        
           		zap.String("local-member-id", s.MemberId().String()), 
        
           		zap.String("requested-member-add", fmt.Sprintf("%+v", memb)), 
        
           		zap.Error(errors.ErrUnhealthy), 
        
           	) 
        
           	return errors.ErrUnhealthy 
        
           }

A PR #14360 was opened to discuss this.

ahrtr · 2022-09-05T07:03:13Z

Please resolve the conflict.

serathius · 2022-09-05T12:36:17Z

tests/common/member_test.go

+	}
+}
+
+func checkAddResp(t *testing.T, addResp *clientv3.MemberAddResponse, err error) {


This checks error and validates response. Could we maybe split this:

rename checkAddResp to validateMemberAdd and remove err argument.

for error checking we can use require.NoError(err).

Thanks to the suggestions! I'd love to do so , as they make the code clearer.

serathius · 2022-09-05T12:38:22Z

tests/common/member_test.go

+				cc := clus.Client()
+
+				testutils.ExecuteUntil(ctx, t, func() {
+					clus.WaitLeader(t)


Do we need WaitLeader here? I haven't seen it used in other tests as I would expect that NewCluster returns me cluster after leader is established.

No. It turns out we don't need WaitLeader here. I'll remove them.

The first time I saw etcdserver: unhealthy cluster error after calling MemberAdd, I thought it was because all etcd members did not agree on the same leader. Therefore, I added WaitLeader function to the Cluster interface. But the etcdserver: unhealthy cluster error did not disappear. Finally I found it was caused by --strict-reconfig-check.

serathius · 2022-09-05T12:41:53Z

tests/common/member_test.go

+					if nc.config.ClusterSize == 1 {
+						checkAddResp(t, addResp, err)


I would expect reverse result, adding invalid (localhost:123 will never respond) member should fail in single node cluster as you cannot establish quorum. From 2 members only 1 is alive, but quorum is 2.

Actually, adding an invalid member succeeds in the original e2e test. I also tried this in my terminal, it succeeded too. Is this a long-standing bug?🤣

serathius · 2022-09-05T12:43:29Z

tests/common/member_test.go

+	}
+}
+
+func TestMemberAdd_SleetHealthInterval(t *testing.T) {


Suggested change

func TestMemberAdd_SleetHealthInterval(t *testing.T) {

func TestMemberAdd_SleepHealthInterval(t *testing.T) {

serathius · 2022-09-05T12:43:50Z

tests/common/member_test.go

+	}
+}
+
+func TestMemberAdd_SleetHealthInterval(t *testing.T) {


I don't see similar test removed in this PR. Where this test has come from?

This is a scenario not covered before:

In the first 5 seconds(HealthInterval) after the creation of a cluster(with --strict-reconfig-check enabled), any call to MemberAdd and MemberRemove should return an etcdserver: unhealthy cluster error.

As the first 5 seconds(HealthInterval) has passed, call to these functions should be successful.

This behaviour is controlled by the isConnectedFullySince function:

etcd/server/etcdserver/server.go

Lines 1350 to 1358 in d05c894

if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) {

lg.Warn(

"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum",

zap.String("local-member-id", s.MemberId().String()),

zap.String("requested-member-add", fmt.Sprintf("%+v", memb)),

zap.Error(errors.ErrUnhealthy),

)

return errors.ErrUnhealthy

}

ok, please rename/rewrite test so the reason for the behavior is known. For example TestMemberAdd_BeforeConnectedToAllPeers or TestMemberAdd_WaitForQuorum.

serathius · 2022-09-05T12:50:10Z

tests/common/member_test.go

+					clus.WaitLeader(t)
+					var addResp *clientv3.MemberAddResponse
+					var err error
+					time.Sleep(etcdserver.HealthInterval)


Why adding sleep here makes a difference?

serathius · 2022-09-05T13:52:57Z

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

Create cluster
Call MemberAdd
Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

clement2026 · 2022-09-05T14:08:31Z

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

Create cluster

Call MemberAdd

Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

I was worried about making a single test function too large and complex. I'll try and see if I can handle this.

Signed-off-by: Clark <fwyongxing@gmail.com>

clement2026 · 2022-09-06T03:41:42Z

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

Create cluster

Call MemberAdd

Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

I followed your suggestion and it works great! I've force pushed. Please take a look.

serathius reviewed Jul 27, 2022

View reviewed changes

clement2026 force-pushed the migrate-member-add-tests branch from cf7ca3a to bcb1923 Compare July 28, 2022 14:32

clement2026 force-pushed the migrate-member-add-tests branch from bcb1923 to 76c57f8 Compare July 28, 2022 16:01

clement2026 force-pushed the migrate-member-add-tests branch 2 times, most recently from 47c487a to d2ed176 Compare August 2, 2022 07:25

clement2026 force-pushed the migrate-member-add-tests branch from d2ed176 to d05c894 Compare August 14, 2022 11:34

clement2026 mentioned this pull request Aug 14, 2022

Unify testing framework #13637

Open

ahrtr reviewed Aug 15, 2022

View reviewed changes

clement2026 mentioned this pull request Aug 18, 2022

tests: unify --strict-reconfig-check config of common framework #14360

Merged

clement2026 force-pushed the migrate-member-add-tests branch from d05c894 to 8b4b70a Compare September 5, 2022 12:04

serathius reviewed Sep 5, 2022

View reviewed changes

clement2026 force-pushed the migrate-member-add-tests branch from 8b4b70a to 34d9b03 Compare September 5, 2022 12:36

serathius reviewed Sep 5, 2022

View reviewed changes

clement2026 force-pushed the migrate-member-add-tests branch from ae04eab to 236f836 Compare September 5, 2022 18:57

tests: Migrate member add tests to common framework

fcc076f

Signed-off-by: Clark <fwyongxing@gmail.com>

clement2026 force-pushed the migrate-member-add-tests branch from 236f836 to fcc076f Compare September 6, 2022 03:07

serathius approved these changes Sep 6, 2022

View reviewed changes

serathius merged commit f0c95ca into etcd-io:main Sep 6, 2022

clement2026 deleted the migrate-member-add-tests branch September 6, 2022 10:14

		if nc.config.ClusterSize == 1 {
		checkAddResp(t, addResp, err)

	func TestMemberAdd_SleetHealthInterval(t *testing.T) {
	func TestMemberAdd_SleepHealthInterval(t *testing.T) {

	if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) {
	lg.Warn(
	"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum",
	zap.String("local-member-id", s.MemberId().String()),
	zap.String("requested-member-add", fmt.Sprintf("%+v", memb)),
	zap.Error(errors.ErrUnhealthy),
	)
	return errors.ErrUnhealthy
	}

tests: Migrate member add tests to common framework #14281

tests: Migrate member add tests to common framework #14281

Conversation

clement2026 commented Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 27, 2022 • edited Loading

Codecov Report

serathius commented Jul 27, 2022 • edited Loading

clement2026 commented Jul 27, 2022

serathius commented Jul 27, 2022 • edited Loading

serathius commented Jul 27, 2022

clement2026 commented Jul 27, 2022

clement2026 commented Jul 27, 2022

serathius commented Jul 27, 2022

clement2026 commented Jul 28, 2022

clement2026 commented Jul 28, 2022

serathius commented Jul 28, 2022

clement2026 commented Jul 28, 2022

ahrtr commented Aug 8, 2022

clement2026 commented Aug 9, 2022

ahrtr commented Aug 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clement2026 commented Aug 18, 2022 • edited Loading

ahrtr commented Sep 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius Sep 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Sep 5, 2022

clement2026 commented Sep 5, 2022

clement2026 commented Sep 6, 2022

clement2026 commented Jul 27, 2022 •

edited

Loading

serathius Jul 27, 2022 •

edited

Loading

codecov-commenter commented Jul 27, 2022 •

edited

Loading

serathius commented Jul 27, 2022 •

edited

Loading

serathius commented Jul 27, 2022 •

edited

Loading

clement2026 commented Aug 18, 2022 •

edited

Loading

serathius Sep 5, 2022 •

edited

Loading