Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set probabilistic sampling as default subsampling behavior #659

Merged
merged 1 commit into from
Jan 20, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion augur/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,9 @@ def register_arguments(parser):
subsample_group.add_argument('--sequences-per-group', type=int, help="subsample to no more than this number of sequences per category")
subsample_group.add_argument('--subsample-max-sequences', type=int, help="subsample to no more than this number of sequences")
parser.add_argument('--group-by', nargs='+', help="categories with respect to subsample; two virtual fields, \"month\" and \"year\", are supported if they don't already exist as real fields but a \"date\" field does exist")
parser.add_argument('--probabilistic-sampling', action='store_true', help="Sample probabilitically from groups -- useful when there are more groups than requested sequences")
probabilistic_sampling_group = parser.add_mutually_exclusive_group()
probabilistic_sampling_group.add_argument('--probabilistic-sampling', action='store_true', help="Enable probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when `--subsample-max-sequences` is provided.")
probabilistic_sampling_group.add_argument('--no-probabilistic-sampling', action='store_false', dest='probabilistic_sampling')
parser.add_argument('--subsample-seed', help="random number generator seed to allow reproducible sub-sampling (with same input data). Can be number or string.")
parser.add_argument('--exclude-where', nargs='+',
help="Exclude samples matching these conditions. Ex: \"host=rat\" or \"host!=rat\". Multiple values are processed as OR (matching any of those specified will be excluded), not AND")
Expand All @@ -111,6 +113,7 @@ def register_arguments(parser):
parser.add_argument('--query', help="Filter samples by attribute. Uses Pandas Dataframe querying, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query for syntax.")
parser.add_argument('--output', '-o', help="output file", required=True)

parser.set_defaults(probabilistic_sampling=True)

def run(args):
'''
Expand Down
1 change: 1 addition & 0 deletions tests/builds/tb/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ rule filter:
--exclude {input.exclude} \
--group-by {params.group_by} \
--sequences-per-group {params.sequences_per_group} \
--no-probabilistic-sampling
"""

rule mask:
Expand Down
2 changes: 1 addition & 1 deletion tests/builds/various_export_settings/base.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ rule filter:
--output {output.sequences} \
--group-by {params.group_by} \
--sequences-per-group {params.sequences_per_group} \
--no-probabilistic-sampling \
--min-date {params.min_date}
"""

Expand Down Expand Up @@ -112,4 +113,3 @@ rule refine:
--date-inference {params.date_inference} \
--clock-filter-iqd {params.clock_filter_iqd}
"""

3 changes: 2 additions & 1 deletion tests/builds/zika.t
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Filter sequences by a minimum date and an exclusion list and only keep one seque
> --group-by country year month \
> --sequences-per-group 1 \
> --subsample-seed 314159 \
> --no-probabilistic-sampling \
> --min-date 2012 > /dev/null

$ diff -u "results/filtered.fasta" "$TMP/out/filtered.fasta"
Expand Down Expand Up @@ -178,4 +179,4 @@ Export JSON files as v2 auspice outputs.

Switch back to the original directory where testing started.

$ popd > /dev/null
$ popd > /dev/null
279 changes: 0 additions & 279 deletions tests/builds/zika/Snakefile

This file was deleted.

20 changes: 17 additions & 3 deletions tests/functional/filter.t
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,14 @@ With 10 groups to subsample from, this should produce one sequence per group.
> --group-by country year month \
> --subsample-max-sequences 10 \
> --subsample-seed 314159 \
> --no-probabilistic-sampling \
> --output "$TMP/filtered.fasta" > /dev/null
$ grep ">" "$TMP/filtered.fasta" | wc -l
10
\s*10 (re)
$ rm -f "$TMP/filtered.fasta"

Try to filter with subsampling when there are more available groups than requested sequences.
This should fail.
This should fail, as probabilistic sampling is explicitly disabled.

$ ${AUGUR} filter \
> --sequences filter/sequences.fasta \
Expand All @@ -28,12 +29,13 @@ This should fail.
> --group-by country year month \
> --subsample-max-sequences 5 \
> --subsample-seed 314159 \
> --no-probabilistic-sampling \
> --output "$TMP/filtered.fasta"
ERROR: Asked to provide at most 5 sequences, but there are 10 groups.
[1]
$ rm -f "$TMP/filtered.fasta"

Use probabilistic subsampling to handle the case when there are more available groups than requested sequences.
Explicitly use probabilistic subsampling to handle the case when there are more available groups than requested sequences.

$ ${AUGUR} filter \
> --sequences filter/sequences.fasta \
Expand All @@ -45,3 +47,15 @@ Use probabilistic subsampling to handle the case when there are more available g
> --probabilistic-sampling \
> --output "$TMP/filtered.fasta" > /dev/null
$ rm -f "$TMP/filtered.fasta"

Using the default probabilistic subsampling, should work the same as the previous case.

$ ${AUGUR} filter \
> --sequences filter/sequences.fasta \
> --metadata filter/metadata.tsv \
> --min-date 2012 \
> --group-by country year month \
> --subsample-max-sequences 5 \
> --subsample-seed 314159 \
> --output "$TMP/filtered.fasta" > /dev/null
$ rm -f "$TMP/filtered.fasta"