Automate ingest and phylogenetic workflows #52

joverlee521 · 2024-03-29T23:44:31Z

Description of proposed changes

Adds a single GH Action workflow to automate the ingest and phylogenetic workflows, set to run daily at the same time as the automated mpox ingest.

Uses GH Action caches to store hash of ingest results' Metadata.sha256sum values added to the S3 metadata within upload-to-s3. If the cache contains a match from previous runs of the GH Action workflow, then the workflow will skip the phylogenetic job.

See commits for details.

Related issue(s)

Based on discussion in nextstrain/pathogen-repo-guide#25

Checklist

Checks pass
March 26, 2024: Manually triggered first full run
Manually triggered subsequent runs that did not see new data:

Currently just runs the ingest workflow and uploads the results to AWS S3. Subsequent commits will add automation for the phylogenetic workflow.

The phylogenetic workflow will run after the ingest workflow has completed successfully to use the latest available data. Subsequent commits will check if the ingest results included new data to only run the phylogenetic workflow when there's new data.

Uses GitHub Actions cache to store a file that contains the `Metadata.sh256sum` of the ingest files on S3 and use the `hashFiles` function to create a unique cache key. Then the existence of the cache key is an indicator that the ingest file contents have not been updated since a previous run on GH Actions. This does come with a big caveat that GH will remove any cache entries that have not been accessed in over 7 days.¹ If the workflow is not being automatically run within 7 days, then it will always run the phylogenetic job. If this works well, then we may want to consider moving this within the `pathogen-repo-build` reusable workflow to have the same functionality across pathogen automation workflows. ¹ https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy

Add individuals inputs per workflow to override the default Docker image used by `nextstrain build`. Having this input has been extremely helpful to continue running pathogen workflows when we run into new bugs that are not present in older nextstrain-base images. I've made separate image inputs for the two workflows because they use different tools and may require different versions of images.

Copied daily schedule of mpox ingest https://github.com/nextstrain/mpox/blob/e439235ff1c1d66e7285b774e9536e2896d9cd2f/.github/workflows/fetch-and-ingest.yaml#L4-L21 Daily runs seem fine since the ingest workflow currently takes less than 2 minutes to complete and it will not trigger the phylogenetic workflow if there's no new data. We can bring this down to once a week if it seems like overkill.

joverlee521 · 2024-03-29T23:48:06Z

This currently does not support

running individual ingest/phylogenetic workflows
trial runs for staging

I figured these can be added in the future. Will need to think through whether it makes sense to support these within this workflow (which will complicate conditionals) or should we just have separate GH Action workflows for them.

jameshadfield · 2024-04-04T07:35:28Z

.github/workflows/ingest-to-phylogenetic.yaml

@@ -39,11 +39,49 @@ jobs:
        ingest/logs/
        ingest/.snakemake/log/

-  # TKTK check if ingest results include new data
-  # potentially use actions/cache to store Metadata.sha256sum of S3 files
+  # Check if ingest results include new data by checking for the cache


This relies on the ordering of metadata & sequences to be identical across runs, right? I don't think any of our ingest steps sort the data, and any (default, multi-threaded) call to nextclade run will change the ordering. I also wouldn't think the NCBI data is guaranteed to be in the same order across different dataset / entrez runs, but maybe?

This relies on the ordering of metadata & sequences to be identical across runs, right?

Yes! We do not do any explicit sorting, so change in order from upstream data can result in triggering the phylogenetic workflow. However, we've been using this method for ncov-ingest and mpox with dataset and it's worked pretty well.

any (default, multi-threaded) call to nextclade run will change the ordering.

Nextclade can potentially change ordering of the aligned.fasta, but not the sequences.fasta. I don't think it will have an influence on metadata, since usually the joining of the metadata and nextclade data outputs based on metadata order. If we are worried about Nextclade order we can use the --in-order flag to guarantee output order is the same as input order. Although, this repo currently does not include nextclade run in the ingest workflow.

jameshadfield · 2024-04-04T07:42:56Z

.github/workflows/ingest-to-phylogenetic.yaml

-  # potentially use actions/cache to store Metadata.sha256sum of S3 files
+  # Check if ingest results include new data by checking for the cache
+  # of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3)
+  # GitHub will remove any cache entries that have not been accessed in over 7 days,


Why upload the data in the previous step and then check against a cache of the just-overwritten S3 files' SHA256? It seems simpler to compute the SHA256 of the local ingest files before they're uploaded, diff against the current latest S3 files' SHA256s, and if they match then the ingest step succeeds with "no new data" status, and later rules don't run. This'd avoid the 7-day-cache limitation, as well as not add upload (add) unnecessary data to a versioned S3 bucket.

Good question! This is actually already happening as you've described within upload-to-s3. It will only upload the file if the SHA256 does not match.

Using the cache in GitHub Action workflow is my work-around to do this check for "no new data" status without designing some sort of "status" JSON/YAML file.

We discussed this topic a little bit in today's meeting. While upload-to-s3 knows if the file is different ("new data"), this information is encapsulated within the injest job and we are using the check-new-data job as a work-around to avoid extracting this information from the previous job. We'll probably revisit this in the future?

joverlee521 · 2024-04-04T22:00:43Z

I merged after approval from team in today's walk-through of the workflow.

I manually ran the workflow on the main branch.
The first workflow ran both ingest and phylogenetic despite the metadata/sequences files not changing. This is because the GitHub Action cache is segregated by branch.

The second workflow is able to check the cache from the first run on the main branch and only ran the ingest job.

I will check in tomorrow's scheduled run as well.

joverlee521 · 2024-04-05T17:37:30Z

Today's scheduled workflow ran as expected. There was no new data in ingest so the phylogenetic job was skipped.

The zika repo is also now on the pathogen workflow status page. As expected, "Ingest to phylogenetic" shows up as a single workflow with a single completion status.

If we find that we want to split the status of ingest and phylogenetic, we have a couple options:

Update the status page to reflect job status within the workflow. I haven't found a table in steampipe-plugin-github that includes individual job status, but there is a GitHub API endpoint for individual jobs of a workflow run.
We revert back to the previous set up of individual workflows for ingest and phylogenetic, where ingest triggers the phylogenetic workflow after completion.

joverlee521 added 5 commits March 29, 2024 15:18

Start GH Action workflow for automation

d44f2ae

Currently just runs the ingest workflow and uploads the results to AWS S3. Subsequent commits will add automation for the phylogenetic workflow.

joverlee521 requested a review from a team March 29, 2024 23:44

jameshadfield reviewed Apr 4, 2024

View reviewed changes

joverlee521 merged commit 4bebe32 into main Apr 4, 2024
32 checks passed

joverlee521 deleted the automate-workflows branch April 4, 2024 20:23

j23414 mentioned this pull request Apr 5, 2024

Automate ingest and phylogenetic workflows nextstrain/dengue#38

Merged

2 tasks

kimandrews mentioned this pull request Apr 5, 2024

Automate ingest and phylogenetic workflows nextstrain/measles#22

Merged

2 tasks

jameshadfield mentioned this pull request Apr 7, 2024

Output AWS Batch job ID from pathogen-repo-build worklfow nextstrain/.github#79

Closed

This was referenced Apr 8, 2024

Fix automation hash #54

Merged

Automate/simplify process for new sequences #18

Closed

Add independent ingest and phylogenetic GH Action workflows #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate ingest and phylogenetic workflows #52

Automate ingest and phylogenetic workflows #52

joverlee521 commented Mar 29, 2024 •

edited

Loading

joverlee521 commented Mar 29, 2024

jameshadfield Apr 4, 2024

joverlee521 Apr 4, 2024

jameshadfield Apr 4, 2024

joverlee521 Apr 4, 2024

jameshadfield Apr 5, 2024

joverlee521 commented Apr 4, 2024

joverlee521 commented Apr 5, 2024

Automate ingest and phylogenetic workflows #52

Automate ingest and phylogenetic workflows #52

Conversation

joverlee521 commented Mar 29, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 commented Mar 29, 2024

jameshadfield Apr 4, 2024

Choose a reason for hiding this comment

joverlee521 Apr 4, 2024

Choose a reason for hiding this comment

jameshadfield Apr 4, 2024

Choose a reason for hiding this comment

joverlee521 Apr 4, 2024

Choose a reason for hiding this comment

jameshadfield Apr 5, 2024

Choose a reason for hiding this comment

joverlee521 commented Apr 4, 2024

joverlee521 commented Apr 5, 2024

joverlee521 commented Mar 29, 2024 •

edited

Loading