Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: collect/aggregate results to highlight flakiest tests #19182

Open
huonw opened this issue May 28, 2023 · 4 comments
Open

CI: collect/aggregate results to highlight flakiest tests #19182

huonw opened this issue May 28, 2023 · 4 comments
Labels
category:internal CI, fixes for not-yet-released features, etc. enhancement

Comments

@huonw
Copy link
Contributor

huonw commented May 28, 2023

Is your feature request related to a problem? Please describe.

Pants' CI is quite flaky at the moment, with a lot of failures due to exceeding timeouts, and some other types of flakes too. I don't think there's a currently a good way to get insight into what's flaky other than experiencing them for yourself (and/or clicking through a lot of builds), and so people might do spot fixes, but it's not easy to get a systematic view.

Describe the solution you'd like

Some sort of 'dashboard' (even an adhoc/on-demand one) summarising the most common test timeouts and failures over the past month (or some other threshold).

Describe alternatives you've considered

None

Additional context
N/A

@huonw huonw added enhancement category:internal CI, fixes for not-yet-released features, etc. labels May 28, 2023
@huonw
Copy link
Contributor Author

huonw commented May 28, 2023

One possibility here would be:

  1. uploading JUnit XML artefacts after each build, and some sort of report about test timeouts too
  2. GHA automation that pulls down a bunch of those reports and summarises the tests that fail most often and test files that timeout most often, and just prints these (run regularly, each night or each Monday), then a human can go check the logs.

As next steps beyond 2, it could surface the results more obviously. For instance, as a comment in a GH discussion, or splat an HTML file to an GH pages site (or S3 bucket) for us to browse to.

For the initial version, I'd imagine just summarising all failures would be sufficient, and assume that 'real' failures (e.g. being broken in PR CI) will be far less common than flaky ones, for any given test. That is, if a test is flaky, it'll fail regularly across all the builds done by Pants' CI and so be higher up the list of "most failures", while a real failure might only pop-up once or twice in a PR. If it is a problem, one way to reduce that 'real' failure rate would be only looking at the main/2.*.x branch CI results, at the cost of reducing the amount of data (and, particularly, the data about the most acute impact to us developers: PR CI).

@benjyw
Copy link
Sponsor Contributor

benjyw commented May 31, 2023

We should definitely do this, but I'm not sure we'll discover that there is a small fixed set of flaky tests. I have a hunch that it's just arbitrary long-running tests getting resource-starved on the weak-ass GHA machines. Which is why extending timeouts on specific tests tends not to fix this.

But even the info on which the longest-running tests are will be really useful for deciding if they're worth it, and rewriting them if not.

@huonw
Copy link
Contributor Author

huonw commented Sep 22, 2023

@benjyw did this in #19262, uploading to s3://logs.pantsbuild.org. I've written a basic script that can download a batch and summarise them, revealing:

Metadata:

- 39 runs
- 17641 files
- from 2023-09-08 to 2023-09-20
- branches: main
- platforms: Linux-x86_64

*** Failures ***
2023-09-15 run id=6204315638 src/python/pants/backend/python/util_rules/pex_from_targets_test.py::test_constraints_validation
2023-09-17 run id=6217069852 src/python/pants/backend/go/util_rules/first_party_pkg_test.py::test_package_analysis
2023-09-18 run id=6217116063 src/python/pants/backend/python/util_rules/pex_from_targets_test.py::test_constraints_validation
2023-09-20 run id=6256380291 src/python/pants/backend/python/goals/package_pex_binary_integration_test.py::test_complete_platforms
2023-09-20 run id=6256380291 tests/python/pants_test/init/test_plugin_resolver.py::test_exact_requirements_interpreter_change_sdist
2023-09-20 run id=6256380291 tests/python/pants_test/init/test_plugin_resolver.py::test_exact_requirements_interpreter_change_bdist
*** Timeouts ***

i.e. some flaky failures on main, but apparently no timeouts (or maybe I have a bug with the logic there).

@huonw huonw closed this as completed Sep 22, 2023
@huonw
Copy link
Contributor Author

huonw commented Sep 22, 2023

Hm, actually, reopen, this task covers summarising too.

@huonw huonw reopened this Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc. enhancement
Projects
None yet
Development

No branches or pull requests

2 participants