Add section on multi-pass subsampling

nextstrain · Feb 24, 2024 · 4137df9 · 4137df9
1 parent 5ce5902
commit 4137df9
Showing 1 changed file with 181 additions and 0 deletions.
diff --git a/docs/usage/cli/filter.rst b/docs/usage/cli/filter.rst
@@ -71,3 +71,184 @@ The filter command allows you to select a specific number of sequences from spec
     --output filtered.fasta
 
 This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.
+
+Subsampling beyond the capabilities of ``augur filter``
+-------------------------------------------------------
+
+There are some subsampling strategies in which a single call to ``augur filter``
+does not suffice. One such strategy is "tiered subsampling". In this strategy,
+mutually exclusive sets of filters, each representing a "tier", are sampled with
+different subsampling rules. This is commonly used to create geographic tiers.
+Consider this subsampling scheme:
+
+    Sample 20 sequences from North America and 50 sequences from Europe.
+
+This cannot be done in a single call to ``augur filter``. Instead, it can be
+decomposed into multiple schemes, each handled by a single call to ``augur
+filter``. Additionally, there is an extra step to combine the intermediate
+samples.
+
+    1. Sample 20 sequences from North America.
+    2. Sample 50 sequences from Europe.
+    3. Combine the samples.
+
+Calling ``augur filter`` multiple times
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Create each intermediate sample using ``augur filter``, then join them together
+to create the final subsample.
+
+.. code-block:: bash
+
+    # 1. Sample 20 sequences from North America
+	augur filter \
+      --sequences sequences.fasta \
+      --metadata metadata.tsv \
+      --query "region == 'North America'" \
+      --subsample-max-sequences 20 \
+      --output-sequences north_america_sample.fasta \
+      --output-metadata north_america_sample.tsv
+
+    # 2. Sample 50 sequences from Europe
+	augur filter \
+      --sequences sequences.fasta \
+      --metadata metadata.tsv \
+      --query "region == 'Europe'" \
+      --subsample-max-sequences 50 \
+      --output-sequences europe_sample.fasta \
+      --output-metadata europe_sample.tsv
+
+    # 3.1 Combine FASTA files
+    cat north_america_sample.fasta europe_sample.fasta > combined.fasta
+
+    # 3.2 Combine metadata files, using tail to skip the extra header row
+    cat north_america_sample.tsv > combined.tsv
+    tail -n+2 europe_sample.tsv >> combined.tsv
+
+The above approach works fine, but if no sequence filters are used, there is a better approach:
+
+.. code-block:: bash
+
+    # 1. Sample 20 sequences from North America
+	augur filter \
+      --metadata metadata.tsv \
+      --query "region == 'North America'" \
+      --subsample-max-sequences 20 \
+      --output-strains north_america_sample_strains.txt
+
+    # 2. Sample 50 sequences from Europe
+	augur filter \
+      --metadata metadata.tsv \
+      --query "region == 'Europe'" \
+      --subsample-max-sequences 50 \
+      --output-metadata europe_sample_strains.txt
+
+    # 3. Combine using augur filter
+	augur filter \
+      --sequences sequences.fasta \
+      --metadata metadata.tsv \
+      --exclude-all \
+      --include north_america_sample_strains.txt \
+                europe_sample_strains.txt \
+      --output-sequences combined.fasta \
+      --output-metadata combined.tsv
+
+Differences:
+
+- In the first two steps, ``--input-sequences`` is removed.
+- In the first two steps, ``--output-sequences`` and ``--output-metadata`` are
+  replaced by ``--output-strains``.
+- The third step of combining the two intermediate samples is done using ``augur
+  filter`` itself. This uses ``augur filter`` to sample the data based on an
+  exact list of strains using a combination of ``--exclude-all`` and ``--include``.
+
+For large datasets, this has a computational benefit. Reading through a large
+FASTA file can take a long time. Instead of reading through it twice in the
+first two steps, read through it once in the final step.
+
+Stitching calls together with Snakemake
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The approach above can be cumbersome with more intermediate samples. There is a
+more flexible approach using `Snakemake`_.
+
+1. Add a section in the `config file`_.
+
+  .. code-block:: yaml
+
+    subsampling:
+      north_america: --query "region == 'North America'" --subsample-max-sequences 20
+      europe: --query "region == 'Europe'" --subsample-max-sequences 50
+
+2. Add two rules in a `Snakefile`_.
+
+  .. code-block:: python
+
+    # 1. Sample 20 sequences from North America
+    # 2. Sample 50 sequences from Europe
+    rule intermediate_sample:
+        input:
+            metadata = "data/metadata.tsv",
+        output:
+            strains = "results/{sample_name}_sample_strains.txt",
+        params:
+            augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
+        shell:
+            """
+            augur filter \
+                --metadata {input.metadata} \
+                {params.augur_filter_args} \
+                --output-strains {output.strains}
+            """
+
+    # 3. Combine using augur filter
+    rule combine_intermediate_samples:
+        input:
+            sequences = "data/sequences.fasta",
+            metadata = "data/metadata.tsv",
+            intermediate_sample_strains = expand("results/{sample_name}_sample_strains.txt", sample_name=list(config.get("subsampling", {}).keys()))
+        output:
+            sequences = "results/combined.fasta",
+            metadata = "results/combined.tsv",
+        shell:
+            """
+            augur filter \
+                --sequences {input.sequences} \
+                --metadata {input.metadata} \
+                --exclude-all \
+                --include {input.intermediate_sample_strains} \
+                --output-sequences {output.sequences} \
+                --output-metadata {output.metadata}
+            """
+
+3. Run Snakemake targeting the second rule.
+
+  .. code-block:: bash
+
+    snakemake --forcerun combine_intermediate_samples
+
+Explanation:
+
+- The configuration section consists of one entry per intermediate sample in the
+  format ``sample_name: <augur filter arguments>``.
+- The first rule is run once per intermediate sample using `wildcards`_ and an
+  `input function`_. The output of each run is the sampled strain list.
+- The second rule uses `expand()`_ to define input as all the intermediate
+  sampled strain lists, which are passed directly to ``--include`` as done in
+  the previous example.
+
+It is easy to add or remove intermediate samples. Example:
+
+.. code-block:: diff
+
+  subsampling:
+  -  north_america: --query "region == 'North America'" --subsample-max-sequences 20
+     europe: --query "region == 'Europe'" --subsample-max-sequences 50
+  +  asia: --query "region == 'Asia'" --subsample-max-sequences 30
+
+.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html
+.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration
+.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
+.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
+.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions
+.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function