Merge pull request #198: Move filtering/subsampling docs into this pr…

…oject
nextstrain · May 1, 2024 · 6464665 · 6464665
2 parents e026d5a + 1f77c01
commit 6464665
Show file tree

Hide file tree

Showing 4 changed files with 211 additions and 5 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,3 @@ src/guides/bioinformatics/translate_ref.md
 src/guides/bioinformatics/vcf_input.md
 src/guides/communicate/create-pdf.md
 src/guides/communicate/narratives-intro.md
-src/snippets/filtering-and-subsampling.rst
diff --git a/src/conf.py b/src/conf.py
@@ -82,8 +82,6 @@
     'augur': ('https://docs.nextstrain.org/projects/augur/page/', None),
     'auspice': ('https://docs.nextstrain.org/projects/auspice/page/', None),
     'cli': ('https://docs.nextstrain.org/projects/cli/page/', None),
-    # For externally sourced snippets that reference this docs project
-    'docs.nextstrain.org': ('https://docs.nextstrain.org/page/', None),
     'nextclade': ('https://docs.nextstrain.org/projects/nextclade/page/', None),
     'ncov': ('https://docs.nextstrain.org/projects/ncov/page/', None)
 }
diff --git a/src/fetch-docs.py b/src/fetch-docs.py
@@ -22,7 +22,6 @@
     f'{augur_url}faq/fasta_input.md': 'guides/bioinformatics/fasta_input.md',
     f'{augur_url}faq/seq_traits.md': 'guides/bioinformatics/seq_traits.md',
     f'{augur_url}examples/examples.rst': 'guides/bioinformatics/examples.rst',
-    f'{augur_url}usage/cli/snippets/filtering-and-subsampling.rst': 'snippets/filtering-and-subsampling.rst',
 }
 
 if __name__ == '__main__':

diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -8,4 +8,214 @@ sample data.
 .. contents:: Table of Contents
    :local:
 
-.. include:: ../../snippets/filtering-and-subsampling.rst
+Filtering
+=========
+
+The filter command allows you to select various subsets of your input data for
+different types of analysis. A simple example use of this command would be
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --output-sequences filtered_sequences.fasta \
+     --output-metadata filtered_metadata.tsv
+
+This command will select all sequences with collection date in 2012 or later.
+The filter command has a large number of options that allow flexible filtering
+for many common situations. One such use-case is the exclusion of sequences that
+are known to be outliers (e.g. because of sequencing errors, cell-culture
+adaptation, ...). These can be specified in a separate text file (e.g.
+``exclude.txt``):
+
+.. code-block::
+
+   BRA/2016/FC_DQ75D1
+   COL/FLR_00034/2015
+   ...
+
+To drop such strains, you can pass the filename to ``--exclude``:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --output-sequences filtered_sequences.fasta \
+     --output-metadata filtered_metadata.tsv
+
+Subsampling within ``augur filter``
+===================================
+
+Another common filtering operation is subsetting of data to a achieve a more
+even spatio-temporal distribution or to cut-down data set size to more
+manageable numbers. The filter command allows you to select a specific number of
+sequences from specific groups, for example one sequence per month from each
+country:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --group-by country year month \
+     --sequences-per-group 1 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+Subsampling using multiple ``augur filter`` commands
+====================================================
+
+There are some subsampling strategies in which a single call to ``augur filter``
+does not suffice. One such strategy is "tiered subsampling". In this strategy,
+mutually exclusive sets of filters, each representing a "tier", are sampled with
+different subsampling rules. This is commonly used to create geographic tiers.
+Consider this subsampling scheme:
+
+   Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.
+
+This cannot be done in a single call to ``augur filter``. Instead, it can be
+decomposed into multiple schemes, each handled by a single call to ``augur
+filter``. Additionally, there is an extra step to combine the intermediate
+samples.
+
+   1. Sample 100 sequences from Washington state.
+   2. Sample 50 sequences from the rest of the United States.
+   3. Combine the samples.
+
+Calling ``augur filter`` multiple times
+---------------------------------------
+
+A basic approach is to run the ``augur filter`` commands directly. This works
+well for ad-hoc analyses.
+
+.. code-block:: bash
+
+   # 1. Sample 100 sequences from Washington state
+   augur filter \
+     --sequences sequences.fasta \
+     --metadata metadata.tsv \
+     --query "state == 'WA'" \
+     --subsample-max-sequences 100 \
+     --output-strains sample_strains_state.txt
+ 
+   # 2. Sample 50 sequences from the rest of the United States
+   augur filter \
+     --sequences sequences.fasta \
+     --metadata metadata.tsv \
+     --query "state != 'WA' & country == 'USA'" \
+     --subsample-max-sequences 50 \
+     --output-strains sample_strains_country.txt
+ 
+   # 3. Combine using augur filter
+   augur filter \
+     --sequences sequences.fasta \
+     --metadata metadata.tsv \
+     --exclude-all \
+     --include sample_strains_state.txt \
+               sample_strains_country.txt \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+Each intermediate sample is represented by a strain list file obtained from
+``--output-strains``. The final step uses ``augur filter`` with ``--exclude-all``
+and ``--include`` to sample the data based on the intermediate strain list
+files. If the same strain appears in both files, ``augur filter`` will only
+write it once in each of the final outputs.
+
+Generalizing subsampling in a workflow
+--------------------------------------
+
+The approach above can be cumbersome with more intermediate samples. To
+generalize this process and allow for more flexibility, a workflow management
+system can be used. The following examples use `Snakemake`_.
+
+1. Add a section in the `config file`_.
+
+   .. code-block:: yaml
+
+      subsampling:
+        state: --query "state == 'WA'" --subsample-max-sequences 100
+        country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50
+
+2. Add two rules in a `Snakefile`_. If you are building a standard Nextstrain
+   workflow, the output files should be used as input to sequence alignment. See
+   :doc:`../../learn/parts` to learn more about the placement of
+   this step within a workflow.
+
+   .. code-block:: python
+
+      # 1. Sample 100 sequences from Washington state
+      # 2. Sample 50 sequences from the rest of the United States
+      rule intermediate_sample:
+          input:
+              metadata = "data/metadata.tsv",
+          output:
+              strains = "results/sample_strains_{sample_name}.txt",
+          params:
+              augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
+          shell:
+              """
+              augur filter \
+                  --metadata {input.metadata} \
+                  {params.augur_filter_args} \
+                  --output-strains {output.strains}
+              """
+      # 3. Combine using augur filter
+      rule combine_intermediate_samples:
+          input:
+              sequences = "data/sequences.fasta",
+              metadata = "data/metadata.tsv",
+              intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys()))
+          output:
+              sequences = "results/subsampled_sequences.fasta",
+              metadata = "results/subsampled_metadata.tsv",
+          shell:
+              """
+              augur filter \
+                  --sequences {input.sequences} \
+                  --metadata {input.metadata} \
+                  --exclude-all \
+                  --include {input.intermediate_sample_strains} \
+                  --output-sequences {output.sequences} \
+                  --output-metadata {output.metadata}
+              """
+
+3. Run Snakemake targeting the second rule.
+
+   .. code-block:: bash
+
+      snakemake combine_intermediate_samples
+
+Explanation:
+
+- The configuration section consists of one entry per intermediate sample in the
+  format ``sample_name: <augur filter arguments>``.
+- The first rule is run once per intermediate sample using `wildcards`_ and an
+  `input function`_. The output of each run is the sampled strain list.
+- The second rule uses `expand()`_ to define input as all the intermediate
+  sampled strain lists, which are passed directly to ``--include`` as done in
+  the previous example.
+
+It is easy to add or remove intermediate samples. The configuration above can be
+updated to add another tier in between state and country:
+
+  .. code-block:: yaml
+
+   subsampling:
+     state: --query "state == 'WA'" --subsample-max-sequences 100
+     neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75
+     country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50
+
+.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html
+.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration
+.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
+.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
+.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions
+.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function