Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Reduce over-sampling in partial months with --group-by month #960

Closed
victorlin opened this issue May 31, 2022 · 3 comments · Fixed by #1067
Closed

filter: Reduce over-sampling in partial months with --group-by month #960

victorlin opened this issue May 31, 2022 · 3 comments · Fixed by #1067
Assignees
Labels
enhancement New feature or request proposal Proposals that warrant further discussion

Comments

@victorlin
Copy link
Member

victorlin commented May 31, 2022

Context

@trvrb from nextstrain/ncov#957:

With this narrow of timespans there is some unavoidable funny interaction with how augur filter subsamples based on --vpm, ie viruses per month. We have common situations where if current date is say May 15 we end up with

  • min date of March 15
  • desire by augur filter to equally sample viruses from March, April and May categories

so that March and May have 2 weeks for sampling of X viruses and April has 4 weeks for sampling of X viruses. This results in more densely sampled, in terms of viruses per day, months of March and May compared to April.

This effect will be more pronounced in scenarios where current date is, say, May 28, and so X viruses are sampled in 3 days in March and 30 days in April.

To fully address this we'd need to extend augur filter to have the option of per-week sampling categories in addition to per-month sampling categories. Or perhaps some continuous specification. However, I don't think this is too big of an issue in terms of the current PR and it's something we can refine once Augur is updated.

Example

cat > metadata.tsv << ~~
strain	date
SEQ1	2022-03-21
SEQ2	2022-03-22
SEQ3	2022-03-23
SEQ4	2022-04-01
SEQ5	2022-04-02
SEQ6	2022-04-03
SEQ7	2022-05-01
SEQ8	2022-05-02
SEQ9	2022-05-03
SEQ10	2022-05-04
~~

augur filter \
--metadata metadata.tsv \
--min-date 2022-03-15 \
--max-date 2022-05-15 \
--group-by year month \
--subsample-max-sequences 8 \
--subsample-seed 0 \
--output-metadata out.tsv
# Sampling at 2 per group.
# 4 strains were dropped during filtering
# 	4 of these were dropped because of subsampling criteria
# 6 strains passed all filters

cat out.tsv | sort -k 2
# SEQ1	2022-03-21
# SEQ2	2022-03-22
# SEQ4	2022-04-01
# SEQ5	2022-04-02
# SEQ7	2022-05-01
# SEQ9	2022-05-03
# strain	date

When requesting --subsample-max-sequences, this will evenly sample from the 3 groups 2022-03, 2022-04, 2022-05. However, note that the --min-date and --max-date make the sampling window to be half of 2022-03, all of 2022-04, and half of 2022-05. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).

@victorlin victorlin added enhancement New feature or request proposal Proposals that warrant further discussion labels May 31, 2022
@victorlin victorlin self-assigned this May 31, 2022
@victorlin
Copy link
Member Author

victorlin commented May 31, 2022

have the option of per-week sampling categories in addition to per-month sampling categories.

I don't think --group-by week is right:

  • It's more difficult to extract that info from YYYY-MM-DD.
  • There will be the same problem of over-sampling with "partial weeks" if using something like --min-date 1M which translates to 4W and some change.

Or perhaps some continuous specification.

This seems right to me. It is fairly straightforward to enable --group-by day so we can have --group-by ... year month day for the "continuous" approach. Run time might be impacted since this creates ~30x more groups compared to --group-by ... year month. Are there any other drawbacks to this approach?

@trvrb
Copy link
Member

trvrb commented Jun 1, 2022

I definitely take your point on --group-by week, but there are some funny interactions here. In the current system we're often mashing together geography and time into our sampling categories, so we end up with effectively:

  • UK Apr 2022
  • UK May 2022
  • Spain Apr 2022
  • Spain May 2022
  • Africa Apr 2022
  • Africa May 2022
    etc...

for current Europe-focused ncov builds. With 6 month focus we have 6 months x 46 countries = 276 categories. If this was days, we'd have 180 days x 46 countries = 8280 categories. I believe (but could be confused) that by random picking among the 8280 we'd be biasing towards temporal diversity and away from geographic diversity relative to the 276 category scenario. Ie with ~3000 tips in the 276 category scenario you'd have ~11 per country and ~2 per month pretty systematically. But in the 8280 category scenario, I'd think that stochastically you might have different counts per county as each category would be picked ~1/3 of the time. (I might be thinking about this wrong, feel like I'd want to test to confirm)

@corneliusroemer
Copy link
Member

Group by day is not good, because daily sequencing volumne varies a lot whereas weekly volumne does not. There's not much collection on Saturdays, Sundays, etc.

Weekly is the right way to go for now - definitely better than just monthly.

Sorry I only see this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request proposal Proposals that warrant further discussion
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants