Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use metadata-only filtering in subsample jobs #571

Closed
wants to merge 3 commits into from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Mar 3, 2021

Description of proposed changes

Minimize inspection of large FASTA files by subsampling only with metadata and a sequence index and outputting only strain names for each subsample rule. Uses the new metadata-only filter interface proposed for Augur to collect the resulting subsampled strains and output a single FASTA file of all distinct strains.

As noted in 3216c5f, some subsample jobs require priority scores for another subsampled set and these score calculations require subsampled sequences. That same commit as a rule to extract only the subsampled sequences for a specific set as needed.

Although the goal of metadata-only filtering is to dramatically speed up this workflow, this PR does not address a major bottleneck in the current workflow which is the amount of time required by augur filter to write sequences to disk.

Related issue(s)

Related to nextstrain/augur#679

Testing

I've tested this workflow with the Nextstrain Europe build and also the multiple inputs example data.

CI tests will fail because this PR requires an Augur development branch (code that is not available in the Docker image). To test this PR locally, run Snakemake with the conda mode like so:

snakemake --use-conda --cores 4 --profile nextstrain_profiles/nextstrain --config active_builds=europe

Augur filter's new metadata-only interface will allow users to pass
multiple inputs to the `--exclude` argument, internally deduplicating
these strain lists. This new interface eliminates the need for a
separate Snakemake rule to cat the exclusion files.
Replaces FASTA outputs with strain list outputs for the subsample rule
such that sequence data are not inspected during most subsampling steps.
The exception to the rule are subsampling jobs that require a priority
score calculation that depends on the FASTA sequence of another
subsampled group. To handle this exception, we add a new rule to extract
just those subsampled sequences.

Finally, we collect subsampled sequences into a single deduplicated
FASTA output using augur filter's new interface with the `--exclude-all`
flag and multiple input support for `--include`.

Note that this commit also updates the conda environment to use a GitHub
branch instead of an official augur release.
@huddlej huddlej marked this pull request as ready for review April 14, 2021 22:30
@huddlej huddlej changed the title Use metadata-only filtering in subsample jobs WIP: Use metadata-only filtering in subsample jobs Apr 14, 2021
@huddlej
Copy link
Contributor Author

huddlej commented May 24, 2021

This PR will be superceded by work on a related PR that adds a new form of priority calculation.

@huddlej huddlej closed this May 24, 2021
@huddlej huddlej deleted the metadata-only branch May 24, 2021 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant