-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct CRD operator delete predicate, increase test timeout #8715
Conversation
@gracegrimwood , would you be willing to try building this PR in your environment to confirm it fixes your issue? |
Hi Mike, the good news is that it seems to be failing a bit less! (Might be coincidence, might not, but I'm counting it as a win 😁) TimeoutException on io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT
TimeoutException on io.strimzi.operator.cluster.operator.assembly.KafkaConnectMigrationTest(This one was actually quite long, the full test output is in this gist.)
Failure to create Cluster Roles in cluster-operator tests(Full test output is in this gist.)
|
@gracegrimwood , the second commit should resolve the timeout with |
Indeed, seems to have resolved that issue. I'm now seeing a lot of this instead:
A gist with the full output of 3 attempts at |
It seems as though the timeout in
|
I am no longer seeing the crash (or the timeout) with additional locking in the mock server [1]. Were you also still getting errors creating the cluster roles? I didn't spot anything in the gist log you linked @gracegrimwood . |
I don't seem to be, no. I've also run it again a couple times this morning and haven't had it occur. 😃 |
5f4fc70
to
2784c8f
Compare
Once #8895 is merged I'll rebase this and we can see if anything remains. |
2784c8f
to
cc24846
Compare
Tests are clean locally for me (no flakes). The latest push includes #8895, which will be removed once that is merged. |
@MikeEdgar #8895 was merged. So you can now rebase this I guess. |
cc24846
to
fe0ae89
Compare
I see there are several flakes in the CI test run, unfortunately. @scholzj , what are your thoughts on having a way for |
fe0ae89
to
add80c9
Compare
What would be the timeouts? If it helps, why not. But not sure why would it help. These tests should IMHO run one by one and not in parallel. So are they really unable to execute in time on one CPU? |
Although the tests are running serially, there are background processes like minikube and the mock server competing for CPU cycles. The theory is that with fewer available cores there is a potential for the async code to take longer to be scheduled and complete. |
Well, I think you would need to share more details on how would you see it and what would be the dynamic timeouts. The |
Signed-off-by: Michael Edgar <medgar@redhat.com>
Signed-off-by: Michael Edgar <medgar@redhat.com>
add80c9
to
36ae567
Compare
Sorry to have left this hanging around for so long. @scholzj if you see some value in the PR, excluding the third commit with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions. TBH, I do not remember wh we never got back to it. If it helps, we should try to complete it.
...n/src/test/java/io/strimzi/operator/common/operator/resource/StrimziPodSetCrdOperatorIT.java
Show resolved
Hide resolved
...test/java/io/strimzi/operator/common/operator/resource/AbstractCustomResourceOperatorIT.java
Show resolved
Hide resolved
@MikeEdgar For what is worth, it seems to pass on the CI here as well as on my Arm-based Mac. It is always hard to say if it works better or worse, but we can give it a try. So I would proceed with it. |
I allowed my attention to be fully focused elsewhere 😄
At one point there were no flakes listed in the test logs in Azure, but I see some out there now that I don't think would be helped by the |
Looking at it a bit more in an IDE and thinking a bit more about it ... I would probably remove it. I think the CPU part is a bit questionable. In the IT tests related to Kubernetes - is it really the local CPU that is the issue? Or the CPU of the Kube cluster that might be different (both in case of a remote Kube cluster but also when you run it locally but don't give it the full CPU)? The So if we decide to keep it, I think you should use it in a separate method that is called only from selected places such as the IT tests and not from the places such as the system tests. But maybe just removing it might be a better start? |
@ppatierno @im-konge WDYT ^^^? |
Yeah, that's a good point. I was limiting my thinking on this to the minikube scenario with full CPU (which I think is the case in Azure, right?). As you said, better to start with removing it here and possibly adding a separate method for use in individual tests where necessary. |
36ae567
to
284547c
Compare
Removed the changes to |
It is Minikube on Azure with 2 CPU cores (that said, from my subjective experience, the actual performance of the 2 cores differs quite heavily between runs). |
Regarding STs -> we have just a few situations where we are waiting for timeout of In ITs the timeouts are relatively small, but in STs we have timeout around 14 minutes, so in case it would be multiplied, we can get 28 minutes or something (that's for resource readiness), which is something we don't want. So in case you would like to incorporate it, I would maybe create some kind of encapsulation, that would not hit STs. |
The other thing to consider is whether or not this should close #8526. The flaky tests can be a bit of a moving target and some still remain that this PR doesn't resolve. |
I think closing it is fine. We can always open a new one 😉. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MikeEdgar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be late to the party, I have no further comments. LGTM, thanks!
Type of change
Select the type of your PR
Enhancement / new featureRefactoringDocumentationDescription
Fixes #8526
I am unable to confirm the fix with MacOS/aarch64 myself, however.
waitFor
predicate to complete when the resource being deleted is missing.orphan
finalizer to be handled and the resource fully deleted seems to vary significantly between runs. Increasing the timeout to 1 minute appears to be sufficient to allow it to succeed consistently.Checklist
Please go through this checklist and make sure all applicable tasks have been done