diff --git a/docs/usage/cli/filter.rst b/docs/usage/cli/filter.rst index 9947d5dcd..c771f05d1 100644 --- a/docs/usage/cli/filter.rst +++ b/docs/usage/cli/filter.rst @@ -18,210 +18,4 @@ Guides Below are some examples of using ``augur filter`` to sample data. -Filtering ---------- - -The filter command allows you to select various subsets of your input data for different types of analysis. -A simple example use of this command would be - -.. code-block:: bash - - augur filter \ - --sequences data/sequences.fasta \ - --metadata data/metadata.tsv \ - --min-date 2012 \ - --output-sequences filtered_sequences.fasta \ - --output-metadata filtered_metadata.tsv - -This command will select all sequences with collection date in 2012 or later. -The filter command has a large number of options that allow flexible filtering for many common situations. -One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, ...). -These can be specified in a separate text file (e.g. ``exclude.txt``): - -.. code-block:: - - BRA/2016/FC_DQ75D1 - COL/FLR_00034/2015 - ... - -To drop such strains, you can pass the filename to ``--exclude``: - -.. code-block:: bash - - augur filter \ - --sequences data/sequences.fasta \ - --metadata data/metadata.tsv \ - --min-date 2012 \ - --exclude exclude.txt \ - --output-sequences filtered_sequences.fasta \ - --output-metadata filtered_metadata.tsv - -Subsampling within ``augur filter`` ------------------------------------ - -Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. -The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country: - -.. code-block:: bash - - augur filter \ - --sequences data/sequences.fasta \ - --metadata data/metadata.tsv \ - --min-date 2012 \ - --exclude exclude.txt \ - --group-by country year month \ - --sequences-per-group 1 \ - --output-sequences subsampled_sequences.fasta \ - --output-metadata subsampled_metadata.tsv - -Subsampling using multiple ``augur filter`` commands ----------------------------------------------------- - -There are some subsampling strategies in which a single call to ``augur filter`` -does not suffice. One such strategy is "tiered subsampling". In this strategy, -mutually exclusive sets of filters, each representing a "tier", are sampled with -different subsampling rules. This is commonly used to create geographic tiers. -Consider this subsampling scheme: - - Sample 100 sequences from Washington state and 50 sequences from the rest of the United States. - -This cannot be done in a single call to ``augur filter``. Instead, it can be -decomposed into multiple schemes, each handled by a single call to ``augur -filter``. Additionally, there is an extra step to combine the intermediate -samples. - - 1. Sample 100 sequences from Washington state. - 2. Sample 50 sequences from the rest of the United States. - 3. Combine the samples. - -Calling ``augur filter`` multiple times -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -A basic approach is to run the ``augur filter`` commands directly. This works -well for ad-hoc analyses. - -.. code-block:: bash - - # 1. Sample 100 sequences from Washington state - augur filter \ - --sequences sequences.fasta \ - --metadata metadata.tsv \ - --query "state == 'WA'" \ - --subsample-max-sequences 100 \ - --output-strains sample_strains_state.txt - - # 2. Sample 50 sequences from the rest of the United States - augur filter \ - --sequences sequences.fasta \ - --metadata metadata.tsv \ - --query "state != 'WA' & country == 'USA'" \ - --subsample-max-sequences 50 \ - --output-strains sample_strains_country.txt - - # 3. Combine using augur filter - augur filter \ - --sequences sequences.fasta \ - --metadata metadata.tsv \ - --exclude-all \ - --include sample_strains_state.txt \ - sample_strains_country.txt \ - --output-sequences subsampled_sequences.fasta \ - --output-metadata subsampled_metadata.tsv - -Each intermediate sample is represented by a strain list file obtained from -``--output-strains``. The final step uses ``augur filter`` with ``--exclude-all`` -and ``--include`` to sample the data based on the intermediate strain list -files. If the same strain appears in both files, ``augur filter`` will only -write it once in each of the final outputs. - -Generalizing subsampling in a workflow -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The approach above can be cumbersome with more intermediate samples. To -generalize this process and allow for more flexibility, a workflow management -system can be used. The following examples use `Snakemake`_. - -1. Add a section in the `config file`_. - - .. code-block:: yaml - - subsampling: - state: --query "state == 'WA'" --subsample-max-sequences 100 - country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50 - -2. Add two rules in a `Snakefile`_. If you are building a standard Nextstrain - workflow, the output files should be used as input to sequence alignment. See - :doc:`docs.nextstrain.org:learn/parts` to learn more about the placement of - this step within a workflow. - - .. code-block:: python - - # 1. Sample 100 sequences from Washington state - # 2. Sample 50 sequences from the rest of the United States - rule intermediate_sample: - input: - metadata = "data/metadata.tsv", - output: - strains = "results/sample_strains_{sample_name}.txt", - params: - augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "") - shell: - """ - augur filter \ - --metadata {input.metadata} \ - {params.augur_filter_args} \ - --output-strains {output.strains} - """ - - # 3. Combine using augur filter - rule combine_intermediate_samples: - input: - sequences = "data/sequences.fasta", - metadata = "data/metadata.tsv", - intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys())) - output: - sequences = "results/subsampled_sequences.fasta", - metadata = "results/subsampled_metadata.tsv", - shell: - """ - augur filter \ - --sequences {input.sequences} \ - --metadata {input.metadata} \ - --exclude-all \ - --include {input.intermediate_sample_strains} \ - --output-sequences {output.sequences} \ - --output-metadata {output.metadata} - """ - -3. Run Snakemake targeting the second rule. - - .. code-block:: bash - - snakemake combine_intermediate_samples - -Explanation: - -- The configuration section consists of one entry per intermediate sample in the - format ``sample_name: ``. -- The first rule is run once per intermediate sample using `wildcards`_ and an - `input function`_. The output of each run is the sampled strain list. -- The second rule uses `expand()`_ to define input as all the intermediate - sampled strain lists, which are passed directly to ``--include`` as done in - the previous example. - -It is easy to add or remove intermediate samples. The configuration above can be -updated to add another tier in between state and country: - - .. code-block:: yaml - - subsampling: - state: --query "state == 'WA'" --subsample-max-sequences 100 - neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75 - country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50 - -.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html -.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration -.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html -.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards -.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions -.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function +.. include:: snippets/filtering-and-subsampling.rst diff --git a/docs/usage/cli/snippets/filtering-and-subsampling.rst b/docs/usage/cli/snippets/filtering-and-subsampling.rst new file mode 100644 index 000000000..b37df0aa5 --- /dev/null +++ b/docs/usage/cli/snippets/filtering-and-subsampling.rst @@ -0,0 +1,207 @@ +Filtering +--------- + +The filter command allows you to select various subsets of your input data for different types of analysis. +A simple example use of this command would be + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv + +This command will select all sequences with collection date in 2012 or later. +The filter command has a large number of options that allow flexible filtering for many common situations. +One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, ...). +These can be specified in a separate text file (e.g. ``exclude.txt``): + +.. code-block:: + + BRA/2016/FC_DQ75D1 + COL/FLR_00034/2015 + ... + +To drop such strains, you can pass the filename to ``--exclude``: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --output-sequences filtered_sequences.fasta \ + --output-metadata filtered_metadata.tsv + +Subsampling within ``augur filter`` +----------------------------------- + +Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. +The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --group-by country year month \ + --sequences-per-group 1 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + +Subsampling using multiple ``augur filter`` commands +---------------------------------------------------- + +There are some subsampling strategies in which a single call to ``augur filter`` +does not suffice. One such strategy is "tiered subsampling". In this strategy, +mutually exclusive sets of filters, each representing a "tier", are sampled with +different subsampling rules. This is commonly used to create geographic tiers. +Consider this subsampling scheme: + + Sample 100 sequences from Washington state and 50 sequences from the rest of the United States. + +This cannot be done in a single call to ``augur filter``. Instead, it can be +decomposed into multiple schemes, each handled by a single call to ``augur +filter``. Additionally, there is an extra step to combine the intermediate +samples. + + 1. Sample 100 sequences from Washington state. + 2. Sample 50 sequences from the rest of the United States. + 3. Combine the samples. + +Calling ``augur filter`` multiple times +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A basic approach is to run the ``augur filter`` commands directly. This works +well for ad-hoc analyses. + +.. code-block:: bash + + # 1. Sample 100 sequences from Washington state + augur filter \ + --sequences sequences.fasta \ + --metadata metadata.tsv \ + --query "state == 'WA'" \ + --subsample-max-sequences 100 \ + --output-strains sample_strains_state.txt + + # 2. Sample 50 sequences from the rest of the United States + augur filter \ + --sequences sequences.fasta \ + --metadata metadata.tsv \ + --query "state != 'WA' & country == 'USA'" \ + --subsample-max-sequences 50 \ + --output-strains sample_strains_country.txt + + # 3. Combine using augur filter + augur filter \ + --sequences sequences.fasta \ + --metadata metadata.tsv \ + --exclude-all \ + --include sample_strains_state.txt \ + sample_strains_country.txt \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv + +Each intermediate sample is represented by a strain list file obtained from +``--output-strains``. The final step uses ``augur filter`` with ``--exclude-all`` +and ``--include`` to sample the data based on the intermediate strain list +files. If the same strain appears in both files, ``augur filter`` will only +write it once in each of the final outputs. + +Generalizing subsampling in a workflow +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The approach above can be cumbersome with more intermediate samples. To +generalize this process and allow for more flexibility, a workflow management +system can be used. The following examples use `Snakemake`_. + +1. Add a section in the `config file`_. + + .. code-block:: yaml + + subsampling: + state: --query "state == 'WA'" --subsample-max-sequences 100 + country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50 + +2. Add two rules in a `Snakefile`_. If you are building a standard Nextstrain + workflow, the output files should be used as input to sequence alignment. See + :doc:`docs.nextstrain.org:learn/parts` to learn more about the placement of + this step within a workflow. + + .. code-block:: python + + # 1. Sample 100 sequences from Washington state + # 2. Sample 50 sequences from the rest of the United States + rule intermediate_sample: + input: + metadata = "data/metadata.tsv", + output: + strains = "results/sample_strains_{sample_name}.txt", + params: + augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "") + shell: + """ + augur filter \ + --metadata {input.metadata} \ + {params.augur_filter_args} \ + --output-strains {output.strains} + """ + + # 3. Combine using augur filter + rule combine_intermediate_samples: + input: + sequences = "data/sequences.fasta", + metadata = "data/metadata.tsv", + intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys())) + output: + sequences = "results/subsampled_sequences.fasta", + metadata = "results/subsampled_metadata.tsv", + shell: + """ + augur filter \ + --sequences {input.sequences} \ + --metadata {input.metadata} \ + --exclude-all \ + --include {input.intermediate_sample_strains} \ + --output-sequences {output.sequences} \ + --output-metadata {output.metadata} + """ + +3. Run Snakemake targeting the second rule. + + .. code-block:: bash + + snakemake combine_intermediate_samples + +Explanation: + +- The configuration section consists of one entry per intermediate sample in the + format ``sample_name: ``. +- The first rule is run once per intermediate sample using `wildcards`_ and an + `input function`_. The output of each run is the sampled strain list. +- The second rule uses `expand()`_ to define input as all the intermediate + sampled strain lists, which are passed directly to ``--include`` as done in + the previous example. + +It is easy to add or remove intermediate samples. The configuration above can be +updated to add another tier in between state and country: + + .. code-block:: yaml + + subsampling: + state: --query "state == 'WA'" --subsample-max-sequences 100 + neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75 + country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50 + +.. _Snakemake: https://snakemake.readthedocs.io/en/stable/index.html +.. _config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#snakefiles-standard-configuration +.. _Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html +.. _wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards +.. _input function: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions +.. _expand(): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function