Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inscrutable "Actions workflow run is stale" error #234

Closed
blast-hardcheese opened this issue Feb 25, 2021 · 19 comments
Closed

Inscrutable "Actions workflow run is stale" error #234

blast-hardcheese opened this issue Feb 25, 2021 · 19 comments

Comments

@blast-hardcheese
Copy link

I'm getting a lot of sporadic failures in reporting, possibly due to the number of parallel builds that are attempting to submit coverage reports.

The way my project is configured is to build the core tests, which takes about four minutes, then builds over twenty other integration tests, each of which takes five or more minutes. It seems as though we may be dancing right on the edge of some sort of limit, possibly due to my naive understanding of the after_n_builds option.

Unfortunately, Googling anything about {'detail': ErrorDetail(string='Actions workflow run is stale', code='not_found')} turns up nothing, so hopefully after now, at least people will find this issue.

Would you kindly explain how to either increase the timeout for when codecov is waiting for coverage segments, or if this is not the case, instruct on how to resolve this error?

Thank you for your assistance, as well as for an excellent product!


==> Uploading reports
    url: https://codecov.io
    query: branch=update%2Fjackson-core-2.12.1&commit=5e16535e81483a6a07612ba10cfe32c328469103&build=598338763&build_url=http%3A%2F%2Fgithub.com%2Ftwilio%2Fguardrail%2Factions%2Fruns%2F598338763&name=&tag=&slug=twilio%2Fguardrail&service=github-actions&flags=&pr=927&job=CI&cmd_args=n,F,Q,Z,f
->  Pinging Codecov
https://codecov.io/upload/v4?package=github-action-20210129-7c25fce&token=secret&branch=update%2Fjackson-core-2.12.1&commit=5e16535e81483a6a07612ba10cfe32c328469103&build=598338763&build_url=http%3A%2F%2Fgithub.com%2Ftwilio%2Fguardrail%2Factions%2Fruns%2F598338763&name=&tag=&slug=twilio%2Fguardrail&service=github-actions&flags=&pr=927&job=CI&cmd_args=n,F,Q,Z,f
{'detail': ErrorDetail(string='Actions workflow run is stale', code='not_found')}
404
==> Uploading to Codecov
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  182k  100    81  100  182k    400   904k --:--:-- --:--:-- --:--:--  904k
    {'detail': ErrorDetail(string='Actions workflow run is stale', code='not_found')}
Error: Codecov failed with the following error: The process '/usr/bin/bash' failed with exit code 1
@thomasrockhu
Copy link
Contributor

Hi @blast-hardcheese, we are working to understand the issue here, but I think for now as a workaround, you can supply the Codecov upload token. Do you have a GitHub Actions CI link that we can take a look at btw?

@blast-hardcheese
Copy link
Author

Do you have a GitHub Actions CI link that we can take a look at btw?

Sure -- you can take a look at many of the recent failures on https://github.com/guardrail-dev/guardrail/ , one example is https://github.com/guardrail-dev/guardrail/pull/1000/checks?check_run_id=1976440163 .

I've just been re-running all the checks and usually the subsequent run is successful.

@blast-hardcheese
Copy link
Author

Additionally, I've moved this repo out from where it was previously hosted, https://github.com/twilio/guardrail/ , within the past 24 hours -- that may impact your investigation. If you need more samples from after the repo was moved over, I can submit them as they come in -- library upgrade PRs are the most likely to trigger this, due to the rate of submission.

@blast-hardcheese
Copy link
Author

More recent example after moving the repo to a new org and re-authorizing: https://github.com/guardrail-dev/guardrail/pull/1004/checks?sha=ff99a5dfa20d69e2f8519ca7d6569f5a6ebb63a8

@thomasrockhu
Copy link
Contributor

@blast-hardcheese, unless I'm missing something, I couldn't find the above error in that latest link. Apologies if it's really blatant and I missed it, but would you mind sharing the name of the job that failed?

@blast-hardcheese
Copy link
Author

@thomasrockhu Ack! I didn't realize that re-running the workflow erased the failure, I thought links were stable.

I was able to reproduce the error on an already merged PR, so this should not change:

https://github.com/guardrail-dev/guardrail/pull/1004/checks?check_run_id=2009728188

Sorry about that!

@blast-hardcheese
Copy link
Author

blast-hardcheese commented Mar 2, 2021

I don't know if this is related, but if this is a race condition, it very well may be -- we're also experiencing the exact opposite problem, where we successfully report all after_n_builds runs (22 runs) asynchronously to codecov.io for a PR, but the callback never fires, so we never get a response to the required codecov build phase.

A normal run looks like this:
image

in this example, it was just hung like this (I've since merged the PR, but you can still see that Codecov is not in the reported checks for that PR, meaning the callback didn't fire):
image

@thomasrockhu
Copy link
Contributor

@blast-hardcheese, I think I resolved most of the Actions workflow is stale. Let me know if that's not the case

As for the most recent example, it didn't fire because we had only received 16 builds (and not 22). It's a little challenging to see which build didn't upload properly, do you happen to know the names of the jobs?

@thomasrockhu
Copy link
Contributor

In that particular example, it looks like some/all of the Scala 15 builds didn't run tests or try to upload to Codecov

@Yoshanuikabundi
Copy link

Hi! Let me know if I should open a new issue for this, but we're having an identical problem. We're planning on reducing the size of our testing matrix in the near future, will this alleviate the problem? Otherwise if you could take a look that'd be great! Thanks :)

@blast-hardcheese
Copy link
Author

In that particular example, it looks like some/all of the Scala 15 builds didn't run tests or try to upload to Codecov

You're completely correct. I didn't realize that I had excluded some coverage uploads while also using after_n_builds -- sorry for confusing the issue here.

I haven't seen the Actions workflow run is stale error for more than a week at this point, so may I ask what you did on your end? Is this something I could have done via the codecov UI somehow, and is there a possibility of this resurfacing? I've noticed some other 👍s on the initial issue, so presumably others are running into this as well

@blast-hardcheese
Copy link
Author

(Also, thank you again for all your help here!)

@laurynas-biveinis
Copy link

FWIW I have been running into this as recently as yesterday in my project too - https://github.com/laurynas-biveinis/unodb/runs/2109776375?check_suite_focus=true

In my case there are two flag-separated configurations, which get uploaded in parallel. Perhaps they should be serialized?

@thomasrockhu
Copy link
Contributor

thomasrockhu commented Mar 21, 2021

@laurynas-biveinis I'm looking into making a patch for this. We should hopefully have that particular edge case fixed this week.

@briansmith
Copy link

I was having this problem and I found adding the Codecov token as a GitHub Actions secret helped. However, I'm now getting this error on every merge to my main branch, after the jobs for the same commit on its feature branch (pre-merge) succeeds.

@briansmith
Copy link

@ChristophWurst
Copy link

I was having this problem and I found adding the Codecov token as a GitHub Actions secret helped.

Unfortunately for any github organization with a wider community this imposes a potential leakage of an access token, hence we at Nextcloud dropped our codecov tokens from the action because the readme says those are not required for public repositories.

Our current mitigation is to report coverage only for a few CI runs, though that can potentially lower the reported coverage as some paths are only triggered by certain tests in our matrix.

@briansmith
Copy link

I was having this problem and I found adding the Codecov token as a GitHub Actions secret helped. However, I'm now getting this error on every merge to my main branch, after the jobs for the same commit on its feature branch (pre-merge) succeeds.

I was mistaken. Although I did start the process of adding a Codecov token as a secret within my GitHub Actions workflow, I never got around to hooking it up to my use of this action, so it was never used. Thus it had no effect. It seems like Codecov must have addressed the issue here on its end.

In issue #300 I suggest a different solution that doesn't require using a Codecov access token: Move the uploading of coverage from the jobs that collect the coverage. If you have only one job that submits coverage data to codecov then you can avoid the timeout issue described above, AFAICT, and you can also properly minimize permissions on the GitHub token. You'd need to upload the coverage data as an artifact in each job that collects coverage information, and then download those artifacts in the job that submits the coverage information, and then use "needs:" to tell GitHub Actions about the dependency between the jobs.

@thomasrockhu-codecov
Copy link
Contributor

Closing as this no longer seems to be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants