-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing flaky tests #3563
Fixing flaky tests #3563
Conversation
99176a5
to
b7e6156
Compare
// if both components are healthy - return c.mod's health, so we can have a stable Health.Message. | ||
if leastHealthy.Health == component.HealthTypeHealthy { | ||
return c.mod.CurrentHealth() | ||
} | ||
return leastHealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race without this - both components are reporting healthy, so the timestamp determines which one is LeastHealthy
. This leads to health message being non-deterministic.
I think it's safe to pick the parent component's health message if both are healthy.
An alternative would be to combine their health messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leave this to @erikbaranowski
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline and I'm good to merge this as is for now. A higher level look at component health management might be needed to fix this in a more elegant way but the path forward will require some more thinking.
@@ -11,7 +11,7 @@ import ( | |||
var backoffRetry = backoff.Config{ | |||
MinBackoff: 10 * time.Millisecond, | |||
MaxBackoff: 1 * time.Second, | |||
MaxRetries: 5, | |||
MaxRetries: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was intermittently getting the following in CI:
--- FAIL: Test_serviceManager (1.28s)
--- FAIL: Test_serviceManager/can_run_service_binary (0.28s)
eventually.go:48:
Error Trace: /Users/runner/work/agent/agent/cmd/grafana-agent-service/service_test.go:40
/Users/runner/work/agent/agent/cmd/grafana-agent-service/eventually.go:68
/Users/runner/work/agent/agent/cmd/grafana-agent-service/eventually.go:33
/Users/runner/work/agent/agent/cmd/grafana-agent-service/eventually.go:21
/Users/runner/work/agent/agent/cmd/grafana-agent-service/service_test.go:38
Error: Received unexpected error:
Post "http://127.0.0.1:49178/echo/response": dial tcp 127.0.0.1:49178: connect: connection refused
FAIL
FAIL github.com/grafana/agent/cmd/grafana-agent-service 1.496s
Couldn't repro this locally. Potentially macOS CI agent is slow (they often are) and we need more time.
require.True( | ||
t, | ||
strings.HasPrefix(s, prefix), | ||
"expected '%v' to have '%v' prefix", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds more useful error message.
Looks good from me, will let Erik make the call on the healthiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Fixing test race condition * Fixing test race condition * Fixing other test race condition * Better fix for race condition #1 * Attempt to reduce test flakyness
* Fixing test race condition * Fixing test race condition * Fixing other test race condition * Better fix for race condition #1 * Attempt to reduce test flakyness
PR Description
Fixes a few flaky tests:
component/loki/source/file/file_test.go
had a race condition on clean up & component creating a filecomponent/module/file/file_test.go
had an issue where health message was not deterministic, leading to test flakiness.Test_serviceManager/can_run_service_binary
fails to dial a local service, potentially not enough time to start up? increased the backoff time.