Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify probabilistic sampling calculation #1588

Closed
victorlin opened this issue Aug 20, 2024 · 2 comments · Fixed by #1599
Closed

Simplify probabilistic sampling calculation #1588

victorlin opened this issue Aug 20, 2024 · 2 comments · Fixed by #1599
Assignees
Labels
proposal Proposals that warrant further discussion

Comments

@victorlin
Copy link
Member

victorlin commented Aug 20, 2024

Context

The probabilistic sampling calculation currently takes the number of sequences available per group as input.

This information doesn't seem to be necessary. The output is based on the number of groups rather than the size of each group. Example:

from augur.filter.subsample import _calculate_fractional_sequences_per_group

_calculate_fractional_sequences_per_group(3, [1,2,3,4])
_calculate_fractional_sequences_per_group(3, [1,1,1,1])
_calculate_fractional_sequences_per_group(3, [100,100,100,100])
_calculate_fractional_sequences_per_group(3, [1,1,1,1000])
# All of the above return 0.726570078125

Taking a closer look at implementation:

while (hi / lo) > 1.1:

This seems to be an overly complex way of approximating the fraction $\frac{\text{max sequences}}{\text{number of groups}}$. When it is changed to

while (hi / lo) > 1.000001:

, the return value of the commands above is $0.7499999898397923 \approx \frac{3}{4}$.

Impact

The deviation can lead to assertion errors here

assert target_group_size < 1.0

when --subsample-max-sequences is slightly lower than the number of groups such as #1598:

_calculate_fractional_sequences_per_group(400, [1,]*406)
# This returns 1.0254 when the exact calculation should be 0.9852

Proposal

Replace _calculate_fractional_sequences_per_group() with the exact fraction target_max_value / len(group_sizes).

@victorlin victorlin added the proposal Proposals that warrant further discussion label Aug 20, 2024
@corneliusroemer
Copy link
Member

I think the reason is that one can't just pick desired_total/groups because some groups might have less than that number. So we wouldn't end up with the right desired total.

@victorlin
Copy link
Member Author

victorlin commented Aug 23, 2024

@corneliusroemer re: #1599 (comment) (copying to discuss design decisions here):

There's a reason the original calculation is the way it is. I don't think that's a bug or unnecessary.

Imagine there are 90 groups with 1 sequence and 10 groups with 1000. We want to sample 1000 sequences.

The original calculation would pick around 91 sequences per group.

Yours now picks 10 per group.

The original resulted in around 1000 sampled sequences.

Your calculation now in only 190.

That scenario does not result in probabilistic sampling and will not fall through this code path. It will only come here if the number of groups exceeds the desired total sequences.

Your example has 100 groups so this would only happen when sampling less than 100 sequences. At that point the number of sequences available in each group does not matter and either 0 or 1 sequences should be picked per group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposals that warrant further discussion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants