Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pathogen-repo-build: Wait for AWS Batch jobs to finish #54

Merged
merged 4 commits into from
Feb 13, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Sep 14, 2023

This avoids the disconnect between the success/failure status of a GitHub Actions workflow run and the actual AWS Batch job, which makes for easier reporting and debugging and generally less cognitive dissonance.

GitHub Actions workflow runs have a very high max timeout of 35 days, but each job in a workflow has a much lower max timeout of 6 hours. Many of our builds should be less than 6 hours¹, but here we support builds up to 24 hours by chaining together 4 GitHub Actions jobs. We can add more jobs to the chain if we need to, but I don't foresee that.

Nearly all of our builds are in public repos, which means they won't consume usage minutes from our GitHub Actions quota. However, they will consume concurrency limits from the quota. We were already frequently bumping into the default free-tier quota of 20 concurrent jobs, so adding more long-running jobs (to wait around for the AWS Batch jobs) was a nonstarter until we upgraded to a Team plan with its corresponding quota of 60 concurrent jobs.²

¹ As of 24 Jan, all Batch jobs except one in the prior week were sub 12 hours. The exception was the GISAID ncov-ingest job launched on 16 Jan. Next longest jobs were 10.5 hours, all GenBank ncov-ingest jobs. https://github.com/nextstrain/private/issues/95#issuecomment-1909155013

² https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1705024616778389 https://github.com/tsibley/blab-standup/blob/HEAD/2023-08-22.md


See also the commit messages of supporting commits leading up to the final commit (above).

Checklist

  • Checks pass

@tsibley tsibley force-pushed the trs/pathogen-repo-build/wait-for-aws-batch-job branch from 275daac to 7864f50 Compare September 21, 2023 21:17
@tsibley tsibley force-pushed the trs/pathogen-repo-build/aws-assume-role branch from 4e26422 to fbff47b Compare January 22, 2024 22:25
Base automatically changed from trs/pathogen-repo-build/aws-assume-role to master February 5, 2024 22:25
@tsibley tsibley force-pushed the trs/pathogen-repo-build/wait-for-aws-batch-job branch 6 times, most recently from 7803f9c to 045c171 Compare February 6, 2024 00:21
@tsibley tsibley changed the base branch from master to trs/actions/upgrade-called-actions February 6, 2024 00:23
Base automatically changed from trs/actions/upgrade-called-actions to master February 6, 2024 00:23
No harm in describing it in the README, esp. as other development
programs are to be added imminently.
@tsibley tsibley force-pushed the trs/pathogen-repo-build/wait-for-aws-batch-job branch 4 times, most recently from f93f183 to 92458ee Compare February 6, 2024 23:34
@tsibley tsibley requested review from a team and joverlee521 February 6, 2024 23:37
@tsibley tsibley marked this pull request as ready for review February 6, 2024 23:37
@tsibley tsibley force-pushed the trs/pathogen-repo-build/wait-for-aws-batch-job branch from 92458ee to 323e376 Compare February 6, 2024 23:41
@jameshadfield
Copy link
Member

jameshadfield commented Feb 7, 2024

I'll review this when I'm back from vacation, but no problem if it is merged before then.

Makefile Show resolved Hide resolved
@tsibley tsibley force-pushed the trs/pathogen-repo-build/wait-for-aws-batch-job branch 3 times, most recently from 49247cb to b8383d3 Compare February 9, 2024 22:46
…hored input YAML

This allows us to author the workflow using YAML anchors/references
(especially with merge keys) since GitHub Actions doesn't otherwise
support those YAML features.

The lack of support is a shame because GitHub Actions workflows can be
very repetitive and anchors/references/merges are a decent solution to
that.  I'm about to add substantial conceptual replication that we won't
want to maintain concretely replicated in the file.

Since the generated file must be checked in, a new CI step ensures the
generated file matches the authored file and the generated file is
excluded from git diffs by default.  An optional local pre-commit hook
is also available for making sure you craft commits that won't run afoul
of the CI check later.  Or you can, as necessary, run `make` manually
before committing.
This lets us more easily test it in development.
This avoids the disconnect between the success/failure status of a
GitHub Actions workflow run and the actual AWS Batch job, which makes
for easier reporting and debugging and generally less cognitive
dissonance.

GitHub Actions workflow _runs_ have a very high max timeout of 35 days,
but each _job_ in a workflow has a much lower max timeout of 6 hours.
Many of our builds should be less than 6 hours¹, but here we support
builds up to 24 hours by chaining together 4 GitHub Actions jobs.  We
can add more jobs to the chain if we need to, but I don't foresee that.

Nearly all of our builds are in public repos, which means they won't
consume usage minutes from our GitHub Actions quota.  However, they
_will_ consume concurrency limits from the quota.  We were already
frequently bumping into the default free-tier quota of 20 concurrent
jobs, so adding more long-running jobs (to wait around for the AWS Batch
jobs) was a nonstarter until we upgraded to a Team plan with its
corresponding quota of 60 concurrent jobs.²

¹ As of 24 Jan, all Batch jobs except one in the prior week were sub 12
  hours.  The exception was the GISAID ncov-ingest job launched on 16 Jan.
  Next longest jobs were 10.5 hours, all GenBank ncov-ingest jobs.
  <nextstrain/private#95 (comment)>

² <https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1705024616778389>
  <https://github.com/tsibley/blab-standup/blob/HEAD/2023-08-22.md>
@tsibley tsibley force-pushed the trs/pathogen-repo-build/wait-for-aws-batch-job branch from b8383d3 to 57eb5e5 Compare February 10, 2024 00:27
@tsibley
Copy link
Member Author

tsibley commented Feb 10, 2024

Repushed to fix a linting issue (but not bug per se) that shellcheck in CI found.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool use of YAML anchors and references, I will need to spend more time to become familiar with it. I was able to successfully run make and use the pre-commit hook.

The actual changes to the pathogen-repo-build workflow looks good by inspection. I think reasonable to test it out with the ncov-ingest workflows next?

In our discussion yesterday about standardizing GH Actions for pathogen repos, @jameshadfield raised a good question about whether this will work with your proposed pathogen-repo-automation workflow...My gut feeling is it will not work, but will have to try it out.

@tsibley
Copy link
Member Author

tsibley commented Feb 10, 2024

In our discussion yesterday about standardizing GH Actions for pathogen repos, @jameshadfield raised a good question about whether this will work with your proposed pathogen-repo-automation workflow...My gut feeling is it will not work, but will have to try it out.

I assume you're referring to this bit in yesterday's meeting notes?

The exact behaviour of AWS job-monitoring beyond the 6h job limit is unclear - if this is built into [pathogen-repo-build.yaml] then the calling job in [pathogen-repo-automation.yaml] will exceed its 6 hour job limit too?

I left a comment in the meeting notes, but since those can get lost, I'll copy it here:

As far as I understand it, there is no "calling job", per se. For a reusable workflow, the calling job exists only in notation. At execution time, the called workflow's jobs become part of the caller's jobs. Or perhaps you might say the called workflow's jobs replace the caller job. If you look at the GitHub Actions web UI or the data model exposed by the GitHub API, this understanding lines up.

@tsibley
Copy link
Member Author

tsibley commented Feb 10, 2024

I agree testing with ncov-ingest or something like that seems like a good next step. Next week, I'll merge this and arrange for that testing.

@tsibley
Copy link
Member Author

tsibley commented Feb 12, 2024

Hmm, actually, ncov-ingest and ncov aren't already using this pathogen-repo-build.yaml workflow. I don't really want to adapt them to use it right now as part of testing this.

Merging, as I suggested quickly last Friday, will cause all callers of this workflow to immediately start using the new version in this PR.

So either we can test in production by merging, fixing anything that crops up… which is probably fine but maybe also breaks the world.

Or I can temporarily modify some subset of callers to point to the workflow on this branch instead and start with more limited testing of real work loads. I'll do that.

@tsibley
Copy link
Member Author

tsibley commented Feb 12, 2024

Extracted some data from calling workflows.

tsibley added a commit to nextstrain/mpox that referenced this pull request Feb 12, 2024
…workflow

Using mpox here to test changes to how we have GitHub Actions interact
with AWS Batch, in advance of merging and releasing those changes to
every pathogen repo.  "Testing in production, just not all at once."

Related-to: <nextstrain/.github#54>
@tsibley
Copy link
Member Author

tsibley commented Feb 12, 2024

I modified mpox to use the workflow from this branch. It works for CI.

Then I triggered a manual (trial run) test: https://github.com/nextstrain/mpox/actions/runs/7879309419/job/21499345399

We'll also see what happens tomorrow after daily ingest is run.

@tsibley
Copy link
Member Author

tsibley commented Feb 13, 2024

Ingest worked this morning.

@tsibley
Copy link
Member Author

tsibley commented Feb 13, 2024

Merging this today so it's used for all workflows tomorrow morning. Things look good with mpox, so let's go.

@tsibley tsibley merged commit 0fcb948 into master Feb 13, 2024
37 of 39 checks passed
@tsibley tsibley deleted the trs/pathogen-repo-build/wait-for-aws-batch-job branch February 13, 2024 21:50
tsibley added a commit to nextstrain/mpox that referenced this pull request Feb 13, 2024
…ld.yaml workflow"

This reverts commit 34feb4a.

After successful runs, the preview version is now merged to master, so
switch back.

Related-to: <nextstrain/.github#54>
@tsibley
Copy link
Member Author

tsibley commented Feb 13, 2024

Changed the mpox workflows back: nextstrain/mpox@f608152

tsibley added a commit to nextstrain/status that referenced this pull request Feb 14, 2024
Before this point, the status of a GitHub Actions workflow run isn't
reflective of the the actual AWS Batch job outcome.

Related-to: <nextstrain/.github#54>
@tsibley
Copy link
Member Author

tsibley commented Feb 14, 2024

Everything overnight looks good:

image

tsibley added a commit to nextstrain/status that referenced this pull request Feb 14, 2024
Before this point, the status of a GitHub Actions workflow run isn't
reflective of the the actual AWS Batch job outcome.

Related-to: <nextstrain/.github#54>
tsibley added a commit to nextstrain/status that referenced this pull request Feb 14, 2024
Before this point, the status of a GitHub Actions workflow run isn't
reflective of the the actual AWS Batch job outcome.

Related-to: <nextstrain/.github#54>
tsibley added a commit to nextstrain/status that referenced this pull request Feb 14, 2024
Before this point, the status of a GitHub Actions workflow run isn't
reflective of the the actual AWS Batch job outcome.

Related-to: <nextstrain/.github#54>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants