Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/probabilistic sampling #629

Merged
merged 4 commits into from
Dec 28, 2020
Merged

Feat/probabilistic sampling #629

merged 4 commits into from
Dec 28, 2020

Conversation

rneher
Copy link
Member

@rneher rneher commented Nov 10, 2020

Description of proposed changes

Allow probabilistic sampling of sequences when there are many more sampling categories than target sequences. This PR picks from these categories with some probability. On average, this will result in the requested number of sequences.

Thank you for contributing to Nextstrain!

@rneher rneher requested a review from huddlej November 10, 2020 16:57
@huddlej huddlej added this to the Feature release 10.1.0 milestone Nov 12, 2020
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR relates to a couple of bigger issues for Augur and ncov builds, so I’ve tried to summarize the context of the PR as I understand it and also provide some alternate approaches for the ncov builds specifically. I also included some thoughts about the specific implementation.

Context and summary

The context for this feature was that we were trying to subsample early North American strains in ncov builds grouping by division, year, and month and only allowing 300 max sequences. There are more combinations of divisions, years, and months (>400) than the number of sequences allowed, so the current subsampling algorithm fails because it cannot pick fewer than 1 sequence per group.

This PR’s feature allows users to specify a maximum number of sequences that is less than the total number of groups by allowing fractional sequences per group and sampling from a Poisson distribution with a mean of that fraction. This approach allows random groups to have zero strains selected in the subsampling with the average number of strains sampled equal to the max sequences allowed.

Thoughts about subsampling

Our current subsampling approach treats each group (e.g., combination of division, year, and month) as equally important, discrete bins, but not all of these bins are equally represented and not all may be equally important.

For example, of the 415+ division/year/month groups, most bins do not have data. We can see this in a punch card plot of the number of samples per year/month and division (a subset of this bigger image is included below). When we group by division, year, and month, we try to sample evenly from all of these bins even though many are nearly empty (e.g., Sinaloa, Montana, Alabama). In this PR, the probabilistic subsampling is as likely to drop these nearly empty bins as it is to drop the larger bins (Washington, California, etc.).

image

Since geographic representation is already so biased, what if we preferred sampling evenly over time instead of time and space? This would mean setting the group_by field to just year and month. Then subsampling would randomly select at most 300 sequences across each month with a higher probability of sampling from geographic regions that are already well represented.

We used a similar approach for the SPHERES USA builds. To minimize biased geographic sampling, we also prioritized contextual samples based on their genetic similarity to the focal set. In this way, samples from poorly represented states that are similar to the focal set would get included.

I’m proposing this change to the subsampling approach as an alternative to the feature in this PR, but that doesn’t mean we won’t eventually need something like this feature.

Thoughts about the implementation of probabilistic sampling

The code in this implementation works for me and, (after staring at it for a little bit), it makes a lot of sense. This seems like a great solution to the problem. I did notice that most important lines of _calculate_sequences_per_group now use one-line boolean tests for allow_fractional . This constant checking for the type of output makes the intermediate states of the code harder to track in my head.

The _calculate_sequences_per_group function now also sometimes returns a float and sometimes an integer (but the type hinting says it will always be a float). It might be easier to understand this code if we write it as a separate function for the fractional calculation and call that as needed from filter.py.

If we feel like this type of functionality is exceeding the domain of augur filter, we could consider refactoring this code into a new augur subsample command (in a separate PR). We could also consider using existing alternative software like QIIME’s genome-sampler which also tries to evenly sample across time and genetic diversity.

I would be interested in opinions from others on the team about all of the above.

Code used to generate data for the plot above

The data I plotted above are available as a gist. Here is the code I used to generate the CSV file:

import datetime
import pandas as pd

df = pd.read_csv("data/metadata.tsv", sep="\t")
df = df[~df["date"].str.contains("X")].copy()
df["date"] = pd.to_datetime(df["date"])

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month

df = df[df["region"] == "North America"].copy()

counts = df.groupby(["division", "year", "month"])["strain"].count().reset_index()
counts["date"] = counts.apply(lambda row: datetime.date(row["year"], row["month"], 1), axis=1)

counts.to_csv("north_america_counts_by_division_year_and_month.csv", index=False)

@rneher
Copy link
Member Author

rneher commented Nov 14, 2020

Thanks for the comments and summary, @huddlej! a few quick remarks:

@emmahodcroft
Copy link
Member

I'd like to bump this as we ran into this again with the builds yesterday. I agree that we may want to revisit a more general discussion on how subsampling could happen, but until then, having some kind of a fix so that we don't run into the issue again would be great.
It seems like if we are ok with taking this 'as-is' for the moment, to near-replicate the current behaviour, this is almost ready, perhaps barring some code clarifications suggested by John?

(But then I'm going to contradict myself and just give a little feedback on the genetic distance + time filtering:)
I also think that while the subsampling just on genetic distance + time works well for the state builds, in some regional contexts it may not be what we want for all the parts of the regional builds. For example, what failed last night was the 'early sample' for Oceania. The region builds are now almost mini-global builds and I think we're also interested in having good 'background', not just sequences that happen to be closest to the earliest ones from that Region - probably a bit of both is good. However I agree that for even more focal builds like states and even countries, we may benefit to switching to just genetic distance + time.

@huddlej
Copy link
Contributor

huddlej commented Dec 10, 2020

That's really interesting how the regional builds turn into mini-global builds with the genetic distance and time filtering approach. It makes sense that we want to have as much flexibility as possible to define our subsampling approaches for all of these different scenarios.

I'm still catching up with Nextstrain PRs, but I can prototype the function refactoring that I mentioned above this Friday to see whether that clarifies readability.

Splits the original `_calculate_sequences_per_group` function that
supported deterministic and probabilistic sampling into two separate
functions to clarify both the function interfaces and the readability of
the code. Updates tests for the new
`_calculate_fractional_sequences_per_group` function, moves the
Snakemake functional test into a cram test, and adds additional
functional tests for augur filter with subsampling.
@codecov
Copy link

codecov bot commented Dec 14, 2020

Codecov Report

Merging #629 (7e4eb13) into master (98c0ea3) will increase coverage by 0.07%.
The diff coverage is 46.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #629      +/-   ##
==========================================
+ Coverage   28.68%   28.76%   +0.07%     
==========================================
  Files          39       39              
  Lines        5389     5409      +20     
  Branches     1326     1332       +6     
==========================================
+ Hits         1546     1556      +10     
- Misses       3785     3795      +10     
  Partials       58       58              
Impacted Files Coverage Δ
augur/filter.py 44.57% <46.15%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98c0ea3...7e4eb13. Read the comment docs.

@huddlej
Copy link
Contributor

huddlej commented Dec 14, 2020

@rneher and @emmahodcroft, I just pushed commit e4d520c that refactors the logic for calculating sequences per group into separate functions. We could do more to make this interface more generic, but I'm happy with at least having these functions consistently return the expected types and having some minimal tests to confirm they work.

@rneher
Copy link
Member Author

rneher commented Dec 28, 2020

This is working well for me locally.

@rneher rneher merged commit fb4d61b into master Dec 28, 2020
@huddlej huddlej deleted the feat/probabilistic_sampling branch December 28, 2020 19:44
@trvrb
Copy link
Member

trvrb commented Dec 31, 2020

^ I think that more-often-than-not prioritizing genetically similar background strains is confusing for regional builds. I think it makes sense when trying to do targeted genomic epi for say a cluster in Washington State, but doesn't make sense for the big picture surveillance builds that we're doing at the regional level. I'd like to do a direct comparison of flat background vs prioritized background to investigate this.

@emmahodcroft
Copy link
Member

emmahodcroft commented Jan 1, 2021

I think it would be good to do a direct comparison and see. The idea here is that connections between the focal region and another region are more likely to be maintained, which can be very important - whereas random sampling might not maintain these. But this is the intention and I'd happy for a check on whether it seems to be functioning as well as intended! Might be good to pick out a few situations we can check specifically.

@rneher
Copy link
Member Author

rneher commented Jan 1, 2021

I don't think this is the right place for this discussion. This (closed) PR doesn't have anything to do with the priorities we use in ncov.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants