Skip to content

Commit

Permalink
Add section on multi-pass subsampling
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Mar 5, 2024
1 parent 67dde2b commit 2d7f96d
Showing 1 changed file with 152 additions and 0 deletions.
152 changes: 152 additions & 0 deletions docs/usage/cli/filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,155 @@ The filter command allows you to select a specific number of sequences from spec
--sequences-per-group 1 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Subsampling beyond the capabilities of ``augur filter``
-------------------------------------------------------

There are some subsampling strategies in which a single call to ``augur filter``
does not suffice. One such strategy is "tiered subsampling". In this strategy,
mutually exclusive sets of filters, each representing a "tier", are sampled with
different subsampling rules. This is commonly used to create geographic tiers.
Consider this subsampling scheme:

Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.

This cannot be done in a single call to ``augur filter``. Instead, it can be
decomposed into multiple schemes, each handled by a single call to ``augur
filter``. Additionally, there is an extra step to combine the intermediate
samples.

1. Sample 100 sequences from Washington state.
2. Sample 50 sequences from the rest of the United States.
3. Combine the samples.

Calling ``augur filter`` multiple times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A basic approach is to run the ``augur filter`` commands directly. This works
well for ad-hoc analyses.

.. code-block:: bash
# 1. Sample 100 sequences from Washington state
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "state == 'WA'" \
--subsample-max-sequences 100 \
--output-strains sample_strains_state.txt
# 2. Sample 50 sequences from the rest of the United States
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "state != 'WA' & country == 'USA'" \
--subsample-max-sequences 50 \
--output-strains sample_strains_country.txt
# 3. Combine using augur filter
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--exclude-all \
--include sample_strains_state.txt \
sample_strains_country.txt \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Each intermediate sample is represented by a strain list file obtained from
``--output-strains``. The final step uses ``augur filter`` with ``--exclude-all``
and ``--include`` to sample the data based on the intermediate strain list
files. If the same strain appears in both files, ``augur filter`` will only
write it once in each of the final outputs.

Generalizing subsampling in a workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The approach above can be cumbersome with more intermediate samples. To
generalize this process and allow for more flexibility, a workflow management
system can be used. The following examples use `Snakemake`_.

1. Add a section in the `config file`_.

.. code-block:: yaml
subsampling:
state: --query "state == 'WA'" --subsample-max-sequences 100
country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50
2. Add two rules in a `Snakefile`_. If you are building a standard Nextstrain
workflow, the output files should be used as input to sequence alignment. See
:doc:`docs.nextstrain.org:learn/parts` to learn more about the placement of
this step within a workflow.

.. code-block:: python
# 1. Sample 100 sequences from Washington state
# 2. Sample 50 sequences from the rest of the United States
rule intermediate_sample:
input:
metadata = "data/metadata.tsv",
output:
strains = "results/sample_strains_{sample_name}.txt",
params:
augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
shell:
"""
augur filter \
--metadata {input.metadata} \
{params.augur_filter_args} \
--output-strains {output.strains}
"""
# 3. Combine using augur filter
rule combine_intermediate_samples:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv",
intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys()))
output:
sequences = "results/subsampled_sequences.fasta",
metadata = "results/subsampled_metadata.tsv",
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--exclude-all \
--include {input.intermediate_sample_strains} \
--output-sequences {output.sequences} \
--output-metadata {output.metadata}
"""
3. Run Snakemake targeting the second rule.

.. code-block:: bash
snakemake combine_intermediate_samples
Explanation:

- The configuration section consists of one entry per intermediate sample in the
format ``sample_name: <augur filter arguments>``.
- The first rule is run once per intermediate sample using `wildcards`_ and an
`input function`_. The output of each run is the sampled strain list.
- The second rule uses `expand()`_ to define input as all the intermediate
sampled strain lists, which are passed directly to ``--include`` as done in
the previous example.

It is easy to add or remove intermediate samples. The configuration above can be
updated to add another tier in between state and country:

.. code-block:: yaml
subsampling:
state: --query "state == 'WA'" --subsample-max-sequences 100
neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75
country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50
.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html
.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration
.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions
.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function

0 comments on commit 2d7f96d

Please sign in to comment.