Add section on multi-pass subsampling

nextstrain · Mar 5, 2024 · 2d7f96d · 2d7f96d
1 parent 67dde2b
commit 2d7f96d
Showing 1 changed file with 152 additions and 0 deletions.
diff --git a/docs/usage/cli/filter.rst b/docs/usage/cli/filter.rst
@@ -73,3 +73,155 @@ The filter command allows you to select a specific number of sequences from spec
     --sequences-per-group 1 \
     --output-sequences subsampled_sequences.fasta \
     --output-metadata subsampled_metadata.tsv
+
+Subsampling beyond the capabilities of ``augur filter``
+-------------------------------------------------------
+
+There are some subsampling strategies in which a single call to ``augur filter``
+does not suffice. One such strategy is "tiered subsampling". In this strategy,
+mutually exclusive sets of filters, each representing a "tier", are sampled with
+different subsampling rules. This is commonly used to create geographic tiers.
+Consider this subsampling scheme:
+
+    Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.
+
+This cannot be done in a single call to ``augur filter``. Instead, it can be
+decomposed into multiple schemes, each handled by a single call to ``augur
+filter``. Additionally, there is an extra step to combine the intermediate
+samples.
+
+    1. Sample 100 sequences from Washington state.
+    2. Sample 50 sequences from the rest of the United States.
+    3. Combine the samples.
+
+Calling ``augur filter`` multiple times
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A basic approach is to run the ``augur filter`` commands directly. This works
+well for ad-hoc analyses.
+
+.. code-block:: bash
+
+  # 1. Sample 100 sequences from Washington state
+  augur filter \
+    --sequences sequences.fasta \
+    --metadata metadata.tsv \
+    --query "state == 'WA'" \
+    --subsample-max-sequences 100 \
+    --output-strains sample_strains_state.txt
+
+  # 2. Sample 50 sequences from the rest of the United States
+  augur filter \
+    --sequences sequences.fasta \
+    --metadata metadata.tsv \
+    --query "state != 'WA' & country == 'USA'" \
+    --subsample-max-sequences 50 \
+    --output-strains sample_strains_country.txt
+
+  # 3. Combine using augur filter
+  augur filter \
+    --sequences sequences.fasta \
+    --metadata metadata.tsv \
+    --exclude-all \
+    --include sample_strains_state.txt \
+              sample_strains_country.txt \
+    --output-sequences subsampled_sequences.fasta \
+    --output-metadata subsampled_metadata.tsv
+
+Each intermediate sample is represented by a strain list file obtained from
+``--output-strains``. The final step uses ``augur filter`` with ``--exclude-all``
+and ``--include`` to sample the data based on the intermediate strain list
+files. If the same strain appears in both files, ``augur filter`` will only
+write it once in each of the final outputs.
+
+Generalizing subsampling in a workflow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The approach above can be cumbersome with more intermediate samples. To
+generalize this process and allow for more flexibility, a workflow management
+system can be used. The following examples use `Snakemake`_.
+
+1. Add a section in the `config file`_.
+
+  .. code-block:: yaml
+
+    subsampling:
+      state: --query "state == 'WA'" --subsample-max-sequences 100
+      country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50
+
+2. Add two rules in a `Snakefile`_. If you are building a standard Nextstrain
+   workflow, the output files should be used as input to sequence alignment. See
+   :doc:`docs.nextstrain.org:learn/parts` to learn more about the placement of
+   this step within a workflow.
+
+  .. code-block:: python
+
+    # 1. Sample 100 sequences from Washington state
+    # 2. Sample 50 sequences from the rest of the United States
+    rule intermediate_sample:
+        input:
+            metadata = "data/metadata.tsv",
+        output:
+            strains = "results/sample_strains_{sample_name}.txt",
+        params:
+            augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
+        shell:
+            """
+            augur filter \
+                --metadata {input.metadata} \
+                {params.augur_filter_args} \
+                --output-strains {output.strains}
+            """
+
+    # 3. Combine using augur filter
+    rule combine_intermediate_samples:
+        input:
+            sequences = "data/sequences.fasta",
+            metadata = "data/metadata.tsv",
+            intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys()))
+        output:
+            sequences = "results/subsampled_sequences.fasta",
+            metadata = "results/subsampled_metadata.tsv",
+        shell:
+            """
+            augur filter \
+                --sequences {input.sequences} \
+                --metadata {input.metadata} \
+                --exclude-all \
+                --include {input.intermediate_sample_strains} \
+                --output-sequences {output.sequences} \
+                --output-metadata {output.metadata}
+            """
+
+3. Run Snakemake targeting the second rule.
+
+  .. code-block:: bash
+
+    snakemake combine_intermediate_samples
+
+Explanation:
+
+- The configuration section consists of one entry per intermediate sample in the
+  format ``sample_name: <augur filter arguments>``.
+- The first rule is run once per intermediate sample using `wildcards`_ and an
+  `input function`_. The output of each run is the sampled strain list.
+- The second rule uses `expand()`_ to define input as all the intermediate
+  sampled strain lists, which are passed directly to ``--include`` as done in
+  the previous example.
+
+It is easy to add or remove intermediate samples. The configuration above can be
+updated to add another tier in between state and country:
+
+  .. code-block:: yaml
+
+    subsampling:
+      state: --query "state == 'WA'" --subsample-max-sequences 100
+      neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75
+      country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50
+
+.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html
+.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration
+.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
+.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
+.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions
+.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function