Simplify probabilistic sampling calculation #1588

victorlin · 2024-08-20T21:19:37Z

Context

The probabilistic sampling calculation currently takes the number of sequences available per group as input.

This information doesn't seem to be necessary. The output is based on the number of groups rather than the size of each group. Example:

from augur.filter.subsample import _calculate_fractional_sequences_per_group

_calculate_fractional_sequences_per_group(3, [1,2,3,4])
_calculate_fractional_sequences_per_group(3, [1,1,1,1])
_calculate_fractional_sequences_per_group(3, [100,100,100,100])
_calculate_fractional_sequences_per_group(3, [1,1,1,1000])
# All of the above return 0.726570078125

Taking a closer look at implementation:

augur/augur/filter/subsample.py

Line 474 in d8faf01

while (hi / lo) > 1.1:

This seems to be an overly complex way of approximating the fraction $\frac{\text{max sequences}}{\text{number of groups}}$. When it is changed to

while (hi / lo) > 1.000001:

, the return value of the commands above is $0.7499999898397923 \approx \frac{3}{4}$.

Impact

The deviation can lead to assertion errors here

augur/augur/filter/subsample.py

Line 285 in da1c89d

assert target_group_size < 1.0

when --subsample-max-sequences is slightly lower than the number of groups such as #1598:

_calculate_fractional_sequences_per_group(400, [1,]*406)
# This returns 1.0254 when the exact calculation should be 0.9852

Proposal

Replace _calculate_fractional_sequences_per_group() with the exact fraction target_max_value / len(group_sizes).

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2024-08-23T17:23:51Z

I think the reason is that one can't just pick desired_total/groups because some groups might have less than that number. So we wouldn't end up with the right desired total.

victorlin · 2024-08-23T18:02:02Z

@corneliusroemer re: #1599 (comment) (copying to discuss design decisions here):

There's a reason the original calculation is the way it is. I don't think that's a bug or unnecessary.

Imagine there are 90 groups with 1 sequence and 10 groups with 1000. We want to sample 1000 sequences.

The original calculation would pick around 91 sequences per group.

Yours now picks 10 per group.

The original resulted in around 1000 sampled sequences.

Your calculation now in only 190.

That scenario does not result in probabilistic sampling and will not fall through this code path. It will only come here if the number of groups exceeds the desired total sequences.

Your example has 100 groups so this would only happen when sampling less than 100 sequences. At that point the number of sequences available in each group does not matter and either 0 or 1 sequences should be picked per group.

victorlin added the proposal Proposals that warrant further discussion label Aug 20, 2024

This was referenced Aug 23, 2024

Subsample has uncaught assertion in ncov #1598

Closed

Use exact fractional sequences per group #1599

Merged

victorlin self-assigned this Aug 28, 2024

victorlin mentioned this issue Aug 29, 2024

Align fractional sequences per group conversion methods #1614

Open

victorlin closed this as completed in #1599 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify probabilistic sampling calculation #1588

Simplify probabilistic sampling calculation #1588

victorlin commented Aug 20, 2024 •

edited

Loading

corneliusroemer commented Aug 23, 2024

victorlin commented Aug 23, 2024 •

edited

Loading

Simplify probabilistic sampling calculation #1588

Simplify probabilistic sampling calculation #1588

Comments

victorlin commented Aug 20, 2024 • edited Loading

Context

Impact

Proposal

corneliusroemer commented Aug 23, 2024

victorlin commented Aug 23, 2024 • edited Loading

victorlin commented Aug 20, 2024 •

edited

Loading

victorlin commented Aug 23, 2024 •

edited

Loading