filter: Reduce over-sampling in partial months with `--group-by month` #960

victorlin · 2022-05-31T20:41:28Z

Context

With this narrow of timespans there is some unavoidable funny interaction with how augur filter subsamples based on --vpm, ie viruses per month. We have common situations where if current date is say May 15 we end up with

min date of March 15

desire by augur filter to equally sample viruses from March, April and May categories

so that March and May have 2 weeks for sampling of X viruses and April has 4 weeks for sampling of X viruses. This results in more densely sampled, in terms of viruses per day, months of March and May compared to April.

This effect will be more pronounced in scenarios where current date is, say, May 28, and so X viruses are sampled in 3 days in March and 30 days in April.

To fully address this we'd need to extend augur filter to have the option of per-week sampling categories in addition to per-month sampling categories. Or perhaps some continuous specification. However, I don't think this is too big of an issue in terms of the current PR and it's something we can refine once Augur is updated.

Example

cat > metadata.tsv << ~~
strain	date
SEQ1	2022-03-21
SEQ2	2022-03-22
SEQ3	2022-03-23
SEQ4	2022-04-01
SEQ5	2022-04-02
SEQ6	2022-04-03
SEQ7	2022-05-01
SEQ8	2022-05-02
SEQ9	2022-05-03
SEQ10	2022-05-04
~~

augur filter \
--metadata metadata.tsv \
--min-date 2022-03-15 \
--max-date 2022-05-15 \
--group-by year month \
--subsample-max-sequences 8 \
--subsample-seed 0 \
--output-metadata out.tsv
# Sampling at 2 per group.
# 4 strains were dropped during filtering
# 	4 of these were dropped because of subsampling criteria
# 6 strains passed all filters

cat out.tsv | sort -k 2
# SEQ1	2022-03-21
# SEQ2	2022-03-22
# SEQ4	2022-04-01
# SEQ5	2022-04-02
# SEQ7	2022-05-01
# SEQ9	2022-05-03
# strain	date

When requesting --subsample-max-sequences, this will evenly sample from the 3 groups 2022-03, 2022-04, 2022-05. However, note that the --min-date and --max-date make the sampling window to be half of 2022-03, all of 2022-04, and half of 2022-05. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).

The text was updated successfully, but these errors were encountered:

victorlin · 2022-05-31T20:47:41Z

have the option of per-week sampling categories in addition to per-month sampling categories.

I don't think --group-by week is right:

It's more difficult to extract that info from YYYY-MM-DD.
There will be the same problem of over-sampling with "partial weeks" if using something like --min-date 1M which translates to 4W and some change.

Or perhaps some continuous specification.

This seems right to me. It is fairly straightforward to enable --group-by day so we can have --group-by ... year month day for the "continuous" approach. Run time might be impacted since this creates ~30x more groups compared to --group-by ... year month. Are there any other drawbacks to this approach?

trvrb · 2022-06-01T22:44:32Z

I definitely take your point on --group-by week, but there are some funny interactions here. In the current system we're often mashing together geography and time into our sampling categories, so we end up with effectively:

UK Apr 2022
UK May 2022
Spain Apr 2022
Spain May 2022
Africa Apr 2022
Africa May 2022
etc...

for current Europe-focused ncov builds. With 6 month focus we have 6 months x 46 countries = 276 categories. If this was days, we'd have 180 days x 46 countries = 8280 categories. I believe (but could be confused) that by random picking among the 8280 we'd be biasing towards temporal diversity and away from geographic diversity relative to the 276 category scenario. Ie with ~3000 tips in the 276 category scenario you'd have ~11 per country and ~2 per month pretty systematically. But in the 8280 category scenario, I'd think that stochastically you might have different counts per county as each category would be picked ~1/3 of the time. (I might be thinking about this wrong, feel like I'd want to test to confirm)

corneliusroemer · 2022-10-19T16:12:23Z

Group by day is not good, because daily sequencing volumne varies a lot whereas weekly volumne does not. There's not much collection on Saturdays, Sundays, etc.

Weekly is the right way to go for now - definitely better than just monthly.

Sorry I only see this now.

victorlin added enhancement New feature or request proposal Proposals that warrant further discussion labels May 31, 2022

victorlin self-assigned this May 31, 2022

victorlin mentioned this issue May 31, 2022

Include 2m timespan in Nextstrain GISAID and open profiles nextstrain/ncov#957

Merged

2 tasks

victorlin mentioned this issue Oct 19, 2022

filter: Allow --group-by week #1063

Closed

This was referenced Oct 20, 2022

filter: Allow --group-by week, refactoring #1067

Merged

filter: Grouping by day works when it shouldn't #1069

Closed

victorlin closed this as completed in #1067 Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Reduce over-sampling in partial months with `--group-by month` #960

filter: Reduce over-sampling in partial months with `--group-by month` #960

victorlin commented May 31, 2022 •

edited

Loading

victorlin commented May 31, 2022 •

edited

Loading

trvrb commented Jun 1, 2022 •

edited

Loading

corneliusroemer commented Oct 19, 2022

filter: Reduce over-sampling in partial months with --group-by month #960

filter: Reduce over-sampling in partial months with --group-by month #960

Comments

victorlin commented May 31, 2022 • edited Loading

Context

Example

victorlin commented May 31, 2022 • edited Loading

trvrb commented Jun 1, 2022 • edited Loading

corneliusroemer commented Oct 19, 2022

filter: Reduce over-sampling in partial months with `--group-by month` #960

filter: Reduce over-sampling in partial months with `--group-by month` #960

victorlin commented May 31, 2022 •

edited

Loading

victorlin commented May 31, 2022 •

edited

Loading

trvrb commented Jun 1, 2022 •

edited

Loading