WIP: Use metadata-only filtering in subsample jobs #571
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Minimize inspection of large FASTA files by subsampling only with metadata and a sequence index and outputting only strain names for each subsample rule. Uses the new metadata-only filter interface proposed for Augur to collect the resulting subsampled strains and output a single FASTA file of all distinct strains.
As noted in 3216c5f, some subsample jobs require priority scores for another subsampled set and these score calculations require subsampled sequences. That same commit as a rule to extract only the subsampled sequences for a specific set as needed.
Although the goal of metadata-only filtering is to dramatically speed up this workflow, this PR does not address a major bottleneck in the current workflow which is the amount of time required by augur filter to write sequences to disk.
Related issue(s)
Related to nextstrain/augur#679
Testing
I've tested this workflow with the Nextstrain Europe build and also the multiple inputs example data.
CI tests will fail because this PR requires an Augur development branch (code that is not available in the Docker image). To test this PR locally, run Snakemake with the conda mode like so: