CI: collect/aggregate results to highlight flakiest tests #19182

huonw · 2023-05-28T03:58:22Z

Is your feature request related to a problem? Please describe.

Pants' CI is quite flaky at the moment, with a lot of failures due to exceeding timeouts, and some other types of flakes too. I don't think there's a currently a good way to get insight into what's flaky other than experiencing them for yourself (and/or clicking through a lot of builds), and so people might do spot fixes, but it's not easy to get a systematic view.

Describe the solution you'd like

Some sort of 'dashboard' (even an adhoc/on-demand one) summarising the most common test timeouts and failures over the past month (or some other threshold).

Describe alternatives you've considered

None

Additional context
N/A

huonw · 2023-05-28T04:08:06Z

One possibility here would be:

uploading JUnit XML artefacts after each build, and some sort of report about test timeouts too
GHA automation that pulls down a bunch of those reports and summarises the tests that fail most often and test files that timeout most often, and just prints these (run regularly, each night or each Monday), then a human can go check the logs.

As next steps beyond 2, it could surface the results more obviously. For instance, as a comment in a GH discussion, or splat an HTML file to an GH pages site (or S3 bucket) for us to browse to.

For the initial version, I'd imagine just summarising all failures would be sufficient, and assume that 'real' failures (e.g. being broken in PR CI) will be far less common than flaky ones, for any given test. That is, if a test is flaky, it'll fail regularly across all the builds done by Pants' CI and so be higher up the list of "most failures", while a real failure might only pop-up once or twice in a PR. If it is a problem, one way to reduce that 'real' failure rate would be only looking at the main/2.*.x branch CI results, at the cost of reducing the amount of data (and, particularly, the data about the most acute impact to us developers: PR CI).

benjyw · 2023-05-31T01:10:39Z

We should definitely do this, but I'm not sure we'll discover that there is a small fixed set of flaky tests. I have a hunch that it's just arbitrary long-running tests getting resource-starved on the weak-ass GHA machines. Which is why extending timeouts on specific tests tends not to fix this.

But even the info on which the longest-running tests are will be really useful for deciding if they're worth it, and rewriting them if not.

huonw · 2023-09-22T06:01:36Z

@benjyw did this in #19262, uploading to s3://logs.pantsbuild.org. I've written a basic script that can download a batch and summarise them, revealing:

Metadata:

- 39 runs
- 17641 files
- from 2023-09-08 to 2023-09-20
- branches: main
- platforms: Linux-x86_64

*** Failures ***
2023-09-15 run id=6204315638 src/python/pants/backend/python/util_rules/pex_from_targets_test.py::test_constraints_validation
2023-09-17 run id=6217069852 src/python/pants/backend/go/util_rules/first_party_pkg_test.py::test_package_analysis
2023-09-18 run id=6217116063 src/python/pants/backend/python/util_rules/pex_from_targets_test.py::test_constraints_validation
2023-09-20 run id=6256380291 src/python/pants/backend/python/goals/package_pex_binary_integration_test.py::test_complete_platforms
2023-09-20 run id=6256380291 tests/python/pants_test/init/test_plugin_resolver.py::test_exact_requirements_interpreter_change_sdist
2023-09-20 run id=6256380291 tests/python/pants_test/init/test_plugin_resolver.py::test_exact_requirements_interpreter_change_bdist
*** Timeouts ***

i.e. some flaky failures on main, but apparently no timeouts (or maybe I have a bug with the logic there).

huonw · 2023-09-22T06:02:09Z

Hm, actually, reopen, this task covers summarising too.

huonw added enhancement category:internal CI, fixes for not-yet-released features, etc. labels May 28, 2023

huonw mentioned this issue May 28, 2023

HTTP 429 too many requests for Pex PEX #9399

Closed

huonw closed this as completed Sep 22, 2023

huonw reopened this Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: collect/aggregate results to highlight flakiest tests #19182

CI: collect/aggregate results to highlight flakiest tests #19182

huonw commented May 28, 2023

huonw commented May 28, 2023

benjyw commented May 31, 2023

huonw commented Sep 22, 2023 •

edited

Loading

huonw commented Sep 22, 2023

CI: collect/aggregate results to highlight flakiest tests #19182

CI: collect/aggregate results to highlight flakiest tests #19182

Comments

huonw commented May 28, 2023

huonw commented May 28, 2023

benjyw commented May 31, 2023

huonw commented Sep 22, 2023 • edited Loading

huonw commented Sep 22, 2023

huonw commented Sep 22, 2023 •

edited

Loading