Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when executing parallel tests, passing context down and back #292

Conversation

phisco
Copy link
Contributor

@phisco phisco commented Jul 1, 2023

This is another approach to solve the same problem as #265 and was developed with the help of @maruina.

Supersedes #291 as I think is a better approach w.r.t. the one proposed there.

This patch reduces the number of accesses to the context in the testEnv to only the ones that are strictly necessary and that should be synchronous. So, testEnv still contains a ctx, but is only to be used as a starting point and won't be updated except after the setup and finish phase.

According to the following logic:

  • If Test is used, the same logic of now applies. So the context is passed to all features being run sequentially and the most updated version is then returned to the caller.
  • If TestInParallel is used (and parallel tests are enabled), the context is first populated by the BeforeEachTest functions and then a dedicated child of the parent context is provided to each test running in parallel, which is then discarded and only the original parent context is passed to the AfterEachTest functions and back into the environment. This is actually fixing a race condition in the previous implementation of TestInParallel.
  • If t.Parallel is used to run one or more features either using Test or TestInParallel, the same logic above applies, given that each caller will get its own context back.

The only caveat is that the context is that the Finish phase will only be able to see the context from the Setup phase and not the one from the features themselves.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 1, 2023
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 1, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @phisco. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 1, 2023
Copy link
Contributor

@vladimirvivien vladimirvivien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phisco, this looks promising. Thanks for jumping in.
I left couple of comments about returning context from env.Test() method.

Also, I would suggest to focus on the following location where the test is actually ran. I would do the followings

Change the name of the function param from t to new:

t.Run(s.Name, func(newT *testing.T) {

This is to doubly-ensure that the right t variable is captured in the closure. I would also do the same here.

@@ -313,8 +321,8 @@ func (e *testEnv) TestInParallel(t *testing.T, testFeatures ...types.Feature) {
//
// BeforeTest and AfterTest operations are executed before and after
// the feature is tested respectively.
func (e *testEnv) Test(t *testing.T, testFeatures ...types.Feature) {
e.processTests(t, false, testFeatures...)
func (e *testEnv) Test(t *testing.T, testFeatures ...types.Feature) context.Context {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change isn't it? It looks additive so, it may not be. But we should check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the context coming out of Test is never used is it ? Is this change to allow multiple Test calls to chain together ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we could actually omit it, I added it because I felt it could be useful for users to inspect the content of the context after calling it, but now that I think about it, it's more an implementation detail of testenv rather than of the Environment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phisco funny thing is while researching the race condition issue, i ran into the need to inspect the context. However, I would lean toward not leaving the signature as is to avoid breakage.

Maybe a separate PR would be useful to update the signature of the Test and TestInParallel methods. That way, it's effect can be studied (or roll back if needed) with minimal disruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladimirvivien, I did some checks, adding a returned context to the signatures would not be breaking change, maybe some linter could complain, but it will build without any issue. However, not returning it, would mean there is no way to inspect the content of the ctx after a run as we were saying, which would mean that we would have no way to check the following test cases:

So, I tried adding two internal testEnv.test/testInParallel functions that are returning the ctx, which are then called directly by their public versions Test/TestInParalle, explicitly discarding the context returned, so that we can use those in tests if we need to check the ctx afterward, but they won't be accessible to users.

I would probably suggest adding the return ctx to the interface, but I totally understand not wanting to change the public API at least as part of this PR, so this could be a good intermediate step while we take a decision around that.

@harshanarayana
Copy link
Contributor

Thanks @phisco and @maruina for working through this fix. I will take a look at the before EOD tomorrow.

@phisco
Copy link
Contributor Author

phisco commented Jul 4, 2023

Thanks @harshanarayana, I'm on vacation, so I'm going to implement the changes suggested by @vladimirvivien by end of the week: avoid returning the context from Test and TesInParallel, and rename the two testing.T parameters. If you have any other request, let me know 🙏

@phisco phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from e12794e to bbd1faf Compare July 9, 2023 09:26
@phisco
Copy link
Contributor Author

phisco commented Jul 9, 2023

@vladimirvivien @harshanarayana, I implemented the changes suggested, let me know what you think 🙏

@phisco phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from bbd1faf to da401a8 Compare July 9, 2023 09:28
phisco and others added 2 commits July 9, 2023 11:28
This patch reduces the number of accesses to the context in the testEnv
to only the ones that are strictly necessary and that should be
synchronous. So, testEnv still contains a ctx, but is only to be used as
a starting point and won't be updated except after the setup and finish
phase.

According to the following logic:

- If `Test` is used, the same logic of now applies. So the context is
  passed to all features being run sequentially and the most updated
  version is then returned to the caller.
- If `TestInParallel` is used (and parallel tests are enabled), the
  context is first populated by the `BeforeEachTest` functions and then
  a dedicated child of the parent context is provided to each test
  running in parallel, which is then discarded and only the original
  parent context is passed to the `AfterEachTest` functions and back
  into the environment. This is actually fixing a race condition in the
  previous implementation of `TestInParallel`.
- If `t.Parallel` is used to run one or more features either using
  `Test` or `TestInParallel`, the same logic above applies, given that
  each caller will get its own context back.

The only caveat is that the context is that the Finish phase will only
be able to see the context from the Setup phase and not the one from the
features themselves.

Co-authored-by: Matteo Ruina <matteo.ruina@datadoghq.com>
Signed-off-by: Philippe Scorsolini <p.scorsolini@gmail.com>
Signed-off-by: Philippe Scorsolini <p.scorsolini@gmail.com>
@phisco phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from da401a8 to a57988a Compare July 9, 2023 09:28
@vladimirvivien
Copy link
Contributor

@phisco Apologies, I hadn't look at this all week.
I am Ok with returning the context for Test() and TestInParallel, and thank you for the insightful research.
Additive changes to an API is usually Ok and this one won't break previous code. Plus I think it will be useful to be able to get to the Context which means the result of a previous env.Test() can be chained with another one.

@ShwethaKumbla
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2023
@phisco phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from a57988a to 25cc833 Compare July 10, 2023 06:23
@phisco
Copy link
Contributor Author

phisco commented Jul 10, 2023

@vladimirvivien, no problem, I agree it's better to return the context, I dropped a57988a 🙏

@maruina
Copy link
Contributor

maruina commented Jul 11, 2023

@vladimirvivien what do you think it's missing for merging this PR?

@harshanarayana
Copy link
Contributor

@phisco @maruina @vladimirvivien This is a wonderful change. I don't really have any change request for the changes pushed into this PR.

Thanks for adding some great tests to avoid future regressions of race conditions @phisco

@vladimirvivien
Copy link
Contributor

Apologies for the delay (busy).
I am good with it.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: phisco, vladimirvivien

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2023
@k8s-ci-robot k8s-ci-robot merged commit 5fa0a64 into kubernetes-sigs:main Jul 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants