Feat/probabilistic sampling #629

rneher · 2020-11-10T16:57:39Z

Description of proposed changes

Allow probabilistic sampling of sequences when there are many more sampling categories than target sequences. This PR picks from these categories with some probability. On average, this will result in the requested number of sequences.

Thank you for contributing to Nextstrain!

…rom categories to achieve the desired max-sequences on average

huddlej

This PR relates to a couple of bigger issues for Augur and ncov builds, so I’ve tried to summarize the context of the PR as I understand it and also provide some alternate approaches for the ncov builds specifically. I also included some thoughts about the specific implementation.

Context and summary

The context for this feature was that we were trying to subsample early North American strains in ncov builds grouping by division, year, and month and only allowing 300 max sequences. There are more combinations of divisions, years, and months (>400) than the number of sequences allowed, so the current subsampling algorithm fails because it cannot pick fewer than 1 sequence per group.

This PR’s feature allows users to specify a maximum number of sequences that is less than the total number of groups by allowing fractional sequences per group and sampling from a Poisson distribution with a mean of that fraction. This approach allows random groups to have zero strains selected in the subsampling with the average number of strains sampled equal to the max sequences allowed.

Thoughts about subsampling

Our current subsampling approach treats each group (e.g., combination of division, year, and month) as equally important, discrete bins, but not all of these bins are equally represented and not all may be equally important.

For example, of the 415+ division/year/month groups, most bins do not have data. We can see this in a punch card plot of the number of samples per year/month and division (a subset of this bigger image is included below). When we group by division, year, and month, we try to sample evenly from all of these bins even though many are nearly empty (e.g., Sinaloa, Montana, Alabama). In this PR, the probabilistic subsampling is as likely to drop these nearly empty bins as it is to drop the larger bins (Washington, California, etc.).

Since geographic representation is already so biased, what if we preferred sampling evenly over time instead of time and space? This would mean setting the group_by field to just year and month. Then subsampling would randomly select at most 300 sequences across each month with a higher probability of sampling from geographic regions that are already well represented.

We used a similar approach for the SPHERES USA builds. To minimize biased geographic sampling, we also prioritized contextual samples based on their genetic similarity to the focal set. In this way, samples from poorly represented states that are similar to the focal set would get included.

I’m proposing this change to the subsampling approach as an alternative to the feature in this PR, but that doesn’t mean we won’t eventually need something like this feature.

Thoughts about the implementation of probabilistic sampling

The code in this implementation works for me and, (after staring at it for a little bit), it makes a lot of sense. This seems like a great solution to the problem. I did notice that most important lines of _calculate_sequences_per_group now use one-line boolean tests for allow_fractional . This constant checking for the type of output makes the intermediate states of the code harder to track in my head.

The _calculate_sequences_per_group function now also sometimes returns a float and sometimes an integer (but the type hinting says it will always be a float). It might be easier to understand this code if we write it as a separate function for the fractional calculation and call that as needed from filter.py.

If we feel like this type of functionality is exceeding the domain of augur filter, we could consider refactoring this code into a new augur subsample command (in a separate PR). We could also consider using existing alternative software like QIIME’s genome-sampler which also tries to evenly sample across time and genetic diversity.

I would be interested in opinions from others on the team about all of the above.

Code used to generate data for the plot above

The data I plotted above are available as a gist. Here is the code I used to generate the CSV file:

import datetime
import pandas as pd

df = pd.read_csv("data/metadata.tsv", sep="\t")
df = df[~df["date"].str.contains("X")].copy()
df["date"] = pd.to_datetime(df["date"])

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month

df = df[df["region"] == "North America"].copy()

counts = df.groupby(["division", "year", "month"])["strain"].count().reset_index()
counts["date"] = counts.apply(lambda row: datetime.date(row["year"], row["month"], 1), axis=1)

counts.to_csv("north_america_counts_by_division_year_and_month.csv", index=False)

rneher · 2020-11-14T10:35:33Z

Thanks for the comments and summary, @huddlej! a few quick remarks:

the number of groups only includes non-empty groups:

augur/augur/filter.py

Line 334 in d2d5423

[len(sequences_in_group)
there is definitely a discussion to be had on how we subsample. but I feel this is independent of this issue here. This only extends existing functionality to remove a hard constraint that sequences per group are positive integers
my implementation was certainly aimed at "minimal change and duplication" rather than "maximum clarity". THere is also an ad-hoc constant which probably wants explaining/more sensible values. (https://github.com/nextstrain/augur/pull/629/files#diff-a5e96ae3f818746c03f1c454a53a66583c60fc867c90adda25a1c74403a77470R530) this could be set to something like 1/n_cats.

emmahodcroft · 2020-12-10T09:19:19Z

I'd like to bump this as we ran into this again with the builds yesterday. I agree that we may want to revisit a more general discussion on how subsampling could happen, but until then, having some kind of a fix so that we don't run into the issue again would be great.
It seems like if we are ok with taking this 'as-is' for the moment, to near-replicate the current behaviour, this is almost ready, perhaps barring some code clarifications suggested by John?

(But then I'm going to contradict myself and just give a little feedback on the genetic distance + time filtering:)
I also think that while the subsampling just on genetic distance + time works well for the state builds, in some regional contexts it may not be what we want for all the parts of the regional builds. For example, what failed last night was the 'early sample' for Oceania. The region builds are now almost mini-global builds and I think we're also interested in having good 'background', not just sequences that happen to be closest to the earliest ones from that Region - probably a bit of both is good. However I agree that for even more focal builds like states and even countries, we may benefit to switching to just genetic distance + time.

huddlej · 2020-12-10T16:54:31Z

That's really interesting how the regional builds turn into mini-global builds with the genetic distance and time filtering approach. It makes sense that we want to have as much flexibility as possible to define our subsampling approaches for all of these different scenarios.

I'm still catching up with Nextstrain PRs, but I can prototype the function refactoring that I mentioned above this Friday to see whether that clarifies readability.

Splits the original `_calculate_sequences_per_group` function that supported deterministic and probabilistic sampling into two separate functions to clarify both the function interfaces and the readability of the code. Updates tests for the new `_calculate_fractional_sequences_per_group` function, moves the Snakemake functional test into a cram test, and adds additional functional tests for augur filter with subsampling.

codecov · 2020-12-14T19:24:41Z

Codecov Report

Merging #629 (7e4eb13) into master (98c0ea3) will increase coverage by 0.07%.
The diff coverage is 46.15%.

@@            Coverage Diff             @@
##           master     #629      +/-   ##
==========================================
+ Coverage   28.68%   28.76%   +0.07%     
==========================================
  Files          39       39              
  Lines        5389     5409      +20     
  Branches     1326     1332       +6     
==========================================
+ Hits         1546     1556      +10     
- Misses       3785     3795      +10     
  Partials       58       58

Impacted Files	Coverage Δ
augur/filter.py	`44.57% <46.15%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98c0ea3...7e4eb13. Read the comment docs.

huddlej · 2020-12-14T19:26:25Z

@rneher and @emmahodcroft, I just pushed commit e4d520c that refactors the logic for calculating sequences per group into separate functions. We could do more to make this interface more generic, but I'm happy with at least having these functions consistently return the expected types and having some minimal tests to confirm they work.

…sampling

rneher · 2020-12-28T10:16:32Z

This is working well for me locally.

trvrb · 2020-12-31T18:24:41Z

^ I think that more-often-than-not prioritizing genetically similar background strains is confusing for regional builds. I think it makes sense when trying to do targeted genomic epi for say a cluster in Washington State, but doesn't make sense for the big picture surveillance builds that we're doing at the regional level. I'd like to do a direct comparison of flat background vs prioritized background to investigate this.

emmahodcroft · 2021-01-01T09:40:19Z

I think it would be good to do a direct comparison and see. The idea here is that connections between the focal region and another region are more likely to be maintained, which can be very important - whereas random sampling might not maintain these. But this is the intention and I'd happy for a check on whether it seems to be functioning as well as intended! Might be good to pick out a few situations we can check specifically.

rneher · 2021-01-01T11:35:47Z

I don't think this is the right place for this discussion. This (closed) PR doesn't have anything to do with the priorities we use in ncov.

rneher requested a review from huddlej November 10, 2020 16:57

rneher added 2 commits November 10, 2020 18:21

feat: add probabilistic sampling \n take Poisson distributed number f…

d1dd8e8

…rom categories to achieve the desired max-sequences on average

fix: add test rule and fix stop condition in bisection in filter

7143228

rneher force-pushed the feat/probabilistic_sampling branch from a9f5d1c to 7143228 Compare November 10, 2020 17:21

huddlej added this to the Feature release 10.1.0 milestone Nov 12, 2020

huddlej reviewed Nov 13, 2020

View reviewed changes

huddlej requested review from jameshadfield and trvrb November 13, 2020 19:42

huddlej modified the milestones: Feature release 10.1.0, Feature release 10.X.0 Nov 13, 2020

Merge remote-tracking branch 'origin/master' into feat/probabilistic_…

7e4eb13

…sampling

rneher merged commit fb4d61b into master Dec 28, 2020

huddlej deleted the feat/probabilistic_sampling branch December 28, 2020 19:44

trvrb mentioned this pull request Dec 31, 2020

Filter: Errors out when there are more groups than requested sequences #654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/probabilistic sampling #629

Feat/probabilistic sampling #629

rneher commented Nov 10, 2020

huddlej left a comment

rneher commented Nov 14, 2020

emmahodcroft commented Dec 10, 2020

huddlej commented Dec 10, 2020

codecov bot commented Dec 14, 2020 •

edited

Loading

huddlej commented Dec 14, 2020

rneher commented Dec 28, 2020

trvrb commented Dec 31, 2020

emmahodcroft commented Jan 1, 2021 •

edited

Loading

rneher commented Jan 1, 2021

Feat/probabilistic sampling #629

Feat/probabilistic sampling #629

Conversation

rneher commented Nov 10, 2020

Description of proposed changes

Thank you for contributing to Nextstrain!

huddlej left a comment

Choose a reason for hiding this comment

Context and summary

Thoughts about subsampling

Thoughts about the implementation of probabilistic sampling

Code used to generate data for the plot above

rneher commented Nov 14, 2020

emmahodcroft commented Dec 10, 2020

huddlej commented Dec 10, 2020

codecov bot commented Dec 14, 2020 • edited Loading

Codecov Report

huddlej commented Dec 14, 2020

rneher commented Dec 28, 2020

trvrb commented Dec 31, 2020

emmahodcroft commented Jan 1, 2021 • edited Loading

rneher commented Jan 1, 2021

codecov bot commented Dec 14, 2020 •

edited

Loading

emmahodcroft commented Jan 1, 2021 •

edited

Loading