WIP: Use metadata-only filtering in subsample jobs #571

huddlej · 2021-03-03T00:18:03Z

Description of proposed changes

Minimize inspection of large FASTA files by subsampling only with metadata and a sequence index and outputting only strain names for each subsample rule. Uses the new metadata-only filter interface proposed for Augur to collect the resulting subsampled strains and output a single FASTA file of all distinct strains.

As noted in 3216c5f, some subsample jobs require priority scores for another subsampled set and these score calculations require subsampled sequences. That same commit as a rule to extract only the subsampled sequences for a specific set as needed.

Although the goal of metadata-only filtering is to dramatically speed up this workflow, this PR does not address a major bottleneck in the current workflow which is the amount of time required by augur filter to write sequences to disk.

Related issue(s)

Related to nextstrain/augur#679

Testing

I've tested this workflow with the Nextstrain Europe build and also the multiple inputs example data.

CI tests will fail because this PR requires an Augur development branch (code that is not available in the Docker image). To test this PR locally, run Snakemake with the conda mode like so:

snakemake --use-conda --cores 4 --profile nextstrain_profiles/nextstrain --config active_builds=europe

Augur filter's new metadata-only interface will allow users to pass multiple inputs to the `--exclude` argument, internally deduplicating these strain lists. This new interface eliminates the need for a separate Snakemake rule to cat the exclusion files.

Replaces FASTA outputs with strain list outputs for the subsample rule such that sequence data are not inspected during most subsampling steps. The exception to the rule are subsampling jobs that require a priority score calculation that depends on the FASTA sequence of another subsampled group. To handle this exception, we add a new rule to extract just those subsampled sequences. Finally, we collect subsampled sequences into a single deduplicated FASTA output using augur filter's new interface with the `--exclude-all` flag and multiple input support for `--include`. Note that this commit also updates the conda environment to use a GitHub branch instead of an official augur release.

huddlej · 2021-05-24T15:21:51Z

This PR will be superceded by work on a related PR that adds a new form of priority calculation.

huddlej mentioned this pull request Mar 3, 2021

Stream combined sequences to disk instead of loading into memory #572

Merged

huddlej added 3 commits April 14, 2021 13:31

Fix a typo in the masked file path on S3

70707be

huddlej force-pushed the metadata-only branch from 3216c5f to 102a0e2 Compare April 14, 2021 22:29

huddlej marked this pull request as ready for review April 14, 2021 22:30

huddlej changed the title ~~Use metadata-only filtering in subsample jobs~~ WIP: Use metadata-only filtering in subsample jobs Apr 14, 2021

huddlej closed this May 24, 2021

huddlej deleted the metadata-only branch May 24, 2021 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Use metadata-only filtering in subsample jobs #571

WIP: Use metadata-only filtering in subsample jobs #571

huddlej commented Mar 3, 2021 •

edited

Loading

huddlej commented May 24, 2021

WIP: Use metadata-only filtering in subsample jobs #571

WIP: Use metadata-only filtering in subsample jobs #571

Conversation

huddlej commented Mar 3, 2021 • edited Loading

Description of proposed changes

Related issue(s)

Testing

huddlej commented May 24, 2021

huddlej commented Mar 3, 2021 •

edited

Loading