Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing flaky tests #3563

Merged
merged 5 commits into from
Apr 18, 2023
Merged

Fixing flaky tests #3563

merged 5 commits into from
Apr 18, 2023

Conversation

thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Apr 18, 2023

PR Description

Fixes a few flaky tests:

  • component/loki/source/file/file_test.go had a race condition on clean up & component creating a file
  • component/module/file/file_test.go had an issue where health message was not deterministic, leading to test flakiness.
  • Test_serviceManager/can_run_service_binary fails to dial a local service, potentially not enough time to start up? increased the backoff time.

@thampiotr thampiotr force-pushed the thampiotr/fixing-test-race-condition branch from 99176a5 to b7e6156 Compare April 18, 2023 12:50
Comment on lines +162 to +166
// if both components are healthy - return c.mod's health, so we can have a stable Health.Message.
if leastHealthy.Health == component.HealthTypeHealthy {
return c.mod.CurrentHealth()
}
return leastHealthy
Copy link
Contributor Author

@thampiotr thampiotr Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race without this - both components are reporting healthy, so the timestamp determines which one is LeastHealthy. This leads to health message being non-deterministic.

I think it's safe to pick the parent component's health message if both are healthy.

An alternative would be to combine their health messages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave this to @erikbaranowski

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline and I'm good to merge this as is for now. A higher level look at component health management might be needed to fix this in a more elegant way but the path forward will require some more thinking.

@@ -11,7 +11,7 @@ import (
var backoffRetry = backoff.Config{
MinBackoff: 10 * time.Millisecond,
MaxBackoff: 1 * time.Second,
MaxRetries: 5,
MaxRetries: 10,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intermittently getting the following in CI:

--- FAIL: Test_serviceManager (1.28s)
    --- FAIL: Test_serviceManager/can_run_service_binary (0.28s)
        eventually.go:48: 
            	Error Trace:	/Users/runner/work/agent/agent/cmd/grafana-agent-service/service_test.go:40
            	            				/Users/runner/work/agent/agent/cmd/grafana-agent-service/eventually.go:68
            	            				/Users/runner/work/agent/agent/cmd/grafana-agent-service/eventually.go:33
            	            				/Users/runner/work/agent/agent/cmd/grafana-agent-service/eventually.go:21
            	            				/Users/runner/work/agent/agent/cmd/grafana-agent-service/service_test.go:38
            	Error:      	Received unexpected error:
            	            	Post "http://127.0.0.1:49178/echo/response": dial tcp 127.0.0.1:49178: connect: connection refused
FAIL
FAIL	github.com/grafana/agent/cmd/grafana-agent-service	1.496s

Couldn't repro this locally. Potentially macOS CI agent is slow (they often are) and we need more time.

require.True(
t,
strings.HasPrefix(s, prefix),
"expected '%v' to have '%v' prefix",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds more useful error message.

@thampiotr thampiotr changed the title Fixing test race condition Fixing flaky tests Apr 18, 2023
@thampiotr thampiotr marked this pull request as ready for review April 18, 2023 13:02
@mattdurham
Copy link
Collaborator

Looks good from me, will let Erik make the call on the healthiness.

Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattdurham mattdurham merged commit 8921c2b into main Apr 18, 2023
@mattdurham mattdurham deleted the thampiotr/fixing-test-race-condition branch April 18, 2023 14:09
clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023
* Fixing test race condition

* Fixing test race condition

* Fixing other test race condition

* Better fix for race condition #1

* Attempt to reduce test flakyness
clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023
* Fixing test race condition

* Fixing test race condition

* Fixing other test race condition

* Better fix for race condition #1

* Attempt to reduce test flakyness
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Mar 1, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants