Skip to content

Commit

Permalink
Add section on multi-pass subsampling
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Feb 24, 2024
1 parent 5ce5902 commit 4137df9
Showing 1 changed file with 181 additions and 0 deletions.
181 changes: 181 additions & 0 deletions docs/usage/cli/filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,184 @@ The filter command allows you to select a specific number of sequences from spec
--output filtered.fasta
This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.

Subsampling beyond the capabilities of ``augur filter``
-------------------------------------------------------

There are some subsampling strategies in which a single call to ``augur filter``
does not suffice. One such strategy is "tiered subsampling". In this strategy,
mutually exclusive sets of filters, each representing a "tier", are sampled with
different subsampling rules. This is commonly used to create geographic tiers.
Consider this subsampling scheme:

Sample 20 sequences from North America and 50 sequences from Europe.

This cannot be done in a single call to ``augur filter``. Instead, it can be
decomposed into multiple schemes, each handled by a single call to ``augur
filter``. Additionally, there is an extra step to combine the intermediate
samples.

1. Sample 20 sequences from North America.
2. Sample 50 sequences from Europe.
3. Combine the samples.

Calling ``augur filter`` multiple times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create each intermediate sample using ``augur filter``, then join them together
to create the final subsample.

.. code-block:: bash
# 1. Sample 20 sequences from North America
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "region == 'North America'" \
--subsample-max-sequences 20 \
--output-sequences north_america_sample.fasta \
--output-metadata north_america_sample.tsv
# 2. Sample 50 sequences from Europe
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "region == 'Europe'" \
--subsample-max-sequences 50 \
--output-sequences europe_sample.fasta \
--output-metadata europe_sample.tsv
# 3.1 Combine FASTA files
cat north_america_sample.fasta europe_sample.fasta > combined.fasta
# 3.2 Combine metadata files, using tail to skip the extra header row
cat north_america_sample.tsv > combined.tsv
tail -n+2 europe_sample.tsv >> combined.tsv
The above approach works fine, but if no sequence filters are used, there is a better approach:

.. code-block:: bash
# 1. Sample 20 sequences from North America
augur filter \
--metadata metadata.tsv \
--query "region == 'North America'" \
--subsample-max-sequences 20 \
--output-strains north_america_sample_strains.txt
# 2. Sample 50 sequences from Europe
augur filter \
--metadata metadata.tsv \
--query "region == 'Europe'" \
--subsample-max-sequences 50 \
--output-metadata europe_sample_strains.txt
# 3. Combine using augur filter
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--exclude-all \
--include north_america_sample_strains.txt \
europe_sample_strains.txt \
--output-sequences combined.fasta \
--output-metadata combined.tsv
Differences:

- In the first two steps, ``--input-sequences`` is removed.
- In the first two steps, ``--output-sequences`` and ``--output-metadata`` are
replaced by ``--output-strains``.
- The third step of combining the two intermediate samples is done using ``augur
filter`` itself. This uses ``augur filter`` to sample the data based on an
exact list of strains using a combination of ``--exclude-all`` and ``--include``.

For large datasets, this has a computational benefit. Reading through a large
FASTA file can take a long time. Instead of reading through it twice in the
first two steps, read through it once in the final step.

Stitching calls together with Snakemake
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The approach above can be cumbersome with more intermediate samples. There is a
more flexible approach using `Snakemake`_.

1. Add a section in the `config file`_.

.. code-block:: yaml
subsampling:
north_america: --query "region == 'North America'" --subsample-max-sequences 20
europe: --query "region == 'Europe'" --subsample-max-sequences 50
2. Add two rules in a `Snakefile`_.

.. code-block:: python
# 1. Sample 20 sequences from North America
# 2. Sample 50 sequences from Europe
rule intermediate_sample:
input:
metadata = "data/metadata.tsv",
output:
strains = "results/{sample_name}_sample_strains.txt",
params:
augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
shell:
"""
augur filter \
--metadata {input.metadata} \
{params.augur_filter_args} \
--output-strains {output.strains}
"""
# 3. Combine using augur filter
rule combine_intermediate_samples:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv",
intermediate_sample_strains = expand("results/{sample_name}_sample_strains.txt", sample_name=list(config.get("subsampling", {}).keys()))
output:
sequences = "results/combined.fasta",
metadata = "results/combined.tsv",
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--exclude-all \
--include {input.intermediate_sample_strains} \
--output-sequences {output.sequences} \
--output-metadata {output.metadata}
"""
3. Run Snakemake targeting the second rule.

.. code-block:: bash
snakemake --forcerun combine_intermediate_samples
Explanation:

- The configuration section consists of one entry per intermediate sample in the
format ``sample_name: <augur filter arguments>``.
- The first rule is run once per intermediate sample using `wildcards`_ and an
`input function`_. The output of each run is the sampled strain list.
- The second rule uses `expand()`_ to define input as all the intermediate
sampled strain lists, which are passed directly to ``--include`` as done in
the previous example.

It is easy to add or remove intermediate samples. Example:

.. code-block:: diff
subsampling:
- north_america: --query "region == 'North America'" --subsample-max-sequences 20
europe: --query "region == 'Europe'" --subsample-max-sequences 50
+ asia: --query "region == 'Asia'" --subsample-max-sequences 30
.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html
.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration
.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions
.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function

0 comments on commit 4137df9

Please sign in to comment.