Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update usage of "accession" as the ID column #161

Merged
merged 4 commits into from
Aug 1, 2023

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Jul 17, 2023

Description of proposed changes

Improvements and clarifications around the choice of "accession" as the ID column in this repo. See commit messages.

Related issue(s)

Checklist

  • Checks pass
  • Ingest workflow runs locally
  • Manually compared Auspice JSON visualized in Auspice. Tip attributes are identical. (ref)

@victorlin victorlin self-assigned this Jul 17, 2023
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following up on this! Really nice to see the simplification!

The only request I have is to update the min_augur_version in the main Snakefile.

@victorlin victorlin force-pushed the victorlin/use-custom-metadata-id-column branch from 7f38daa to 4d8c515 Compare July 17, 2023 23:06
@victorlin victorlin changed the title Use "accession" column as ID column directly Update usage of "accession" as the ID column Jul 19, 2023
@victorlin victorlin force-pushed the victorlin/use-custom-metadata-id-column branch from 2b57ce5 to 67c3312 Compare July 20, 2023 20:25
joverlee521
joverlee521 previously approved these changes Jul 21, 2023
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of the pathogen-repo-CI looks as expected.
I've only left some minor non-blocking comments.

workflow/snakemake_rules/core.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/core.smk Outdated Show resolved Hide resolved
@victorlin
Copy link
Member Author

Thanks for the review @joverlee521! I'll take some of your suggestions. I also just checked, and this PR suffers from the same issue of missing the accession node attribute as the similar PR in RSV. This should be fixed before merging, so I'll get to it another day.

@joverlee521
Copy link
Contributor

joverlee521 commented Jul 25, 2023

Ah, thanks @victorlin for pointing out the accession issue identified in the RSV repo! This made me realize that actually there's a lot of attrs missing in the final Auspice tree when the accession and the strain field have different values.

I think there's additional bugs in augur export that I'll comment on in nextstrain/augur#1261. in a separate issue.

@joverlee521
Copy link
Contributor

Ah, also just realized that the construct-recency-from-submission-date script also uses read_metadata and should be updated to use the correct ID column.

@victorlin
Copy link
Member Author

To test that the new Augur changes address the issue here, I created a Docker image with the unreleased Augur changes, updated CI here to use that, and uploaded the CI result's Auspice JSONs to a personal repo to be viewed through nextstrain.org/community. Here are the results:

  1. master/docker: From the latest CI run on the default branch.
  2. pr161/conda: From a test run on a4bf2e8. This is without the new Augur changes.
  3. pr161/docker. From the same test run as above. Since the custom Docker image was used, this is with the new Augur changes.

(2) is missing tip attributes as expected. (3) is the same as (1) but missing Submission Recency, due to @joverlee521's note in the previous comment.

@victorlin victorlin force-pushed the victorlin/use-custom-metadata-id-column branch from c10fde8 to 5b1b4ae Compare July 27, 2023 20:52
@victorlin
Copy link
Member Author

Ok, the force-push above fixed the Submission Recency problem. These are the new results:

  1. pr161/5b1b4ae/conda: This is without the new Augur changes, but now includes Submission Recency.
  2. pr161/5b1b4ae/docker: This is with the new Augur changes, and includes Submission Recency.

(2) is what we should expect once the Augur changes are released and 0b9059e is fixed up. I think it looks comparable to master/85d33b0/docker.

@victorlin victorlin force-pushed the victorlin/use-custom-metadata-id-column branch from 5b1b4ae to da64839 Compare July 31, 2023 20:47
Snakefile Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preview links for da64839's PR-triggered CI run results:

  1. pr161/da64839/conda
  2. pr161/da64839/docker

Both of these should be comparable to master/85d33b0/docker.

@victorlin victorlin marked this pull request as ready for review July 31, 2023 22:07
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the hidden bugs in Augur to get this working!
The phylogenetic build changes and the CI output LGTM.

The only requested change is the one error that I flagged.

ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
Add a note where it should be kept.
Retrieving config that is required should be done using the bracket
syntax.

Retrieving config with .get should always be done with a default value.
There was useful information on why "accession" is used over "strain" in
the docstring of update_example_data. Update that rule to use the
strain_id_field config, and move the context to every config file that
sets strain_id_field as "accession".

Don't set defaults when retrieving strain_id_field so "accession" is
only set on the config level.
New options in Augur 22.2.0 allow usage of this column as the ID column
across all subcommands that read metadata.

For this workflow in particular, the metadata file can now be used
as-is. This removes the need for a modified copy of the metadata. It
also allows specifying an original metadata column "strain" as the
display strain field, rather than a column "strain_original" generated
during the Snakemake workflow.
@victorlin victorlin merged commit d5dc3b2 into master Aug 1, 2023
8 checks passed
@victorlin victorlin deleted the victorlin/use-custom-metadata-id-column branch August 1, 2023 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants