Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix probabilistic subsampling for small values #759

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Aug 13, 2021

Description of proposed changes

Fixes a regression from the rewrite of augur filter where probabilistic subsampling for small values of --subsample-max-sequences could randomly select zero strains and randomly cause our CI tests to fail. Prior to the rewrite of augur filter and introduction of priority queues, we fixed this issue by repeatedly attempting to calculate sequences per group that summed to an integer value greater than zero. However, the way I implemented random queue sizes inside the PriorityQueue class in the rewrite prevented me from using a similar "multiple attempts" approach.

This commit redesigns the way we create priority queues. In the case where we already know the number of sequences per group in the first pass, we create an appropriately-sized priority queue for each group as we encounter it. There is no possibility that the sum of these queue sizes could be zero.

In the case where we need to calculate the number of sequences per group from the first pass, we already know all possible groups and can create their priority queues in bulk. The new create_queues_by_group function allows us to create fixed-sized or randomly-sized queues and also make multiple attempts when queue sizes sum to zero. As a result, the PriorityQueue class is much simpler (it requires no logic about random max sizes) and we can actually test the fixed and random behaviors more carefully with doctests for create_queues_by_group.

Testing

  • Add new doctests for the new create_queues_by_group function
  • Pass CI suite
  • Test with ncov workflow

Fixes a regression from the rewrite of augur filter where probabilistic
subsampling for small values of `--subsample-max-sequences` could
randomly select zero strains and randomly cause our CI tests to fail.
Prior to the rewrite of augur filter and introduction of priority
queues, we fixed this issue by repeatedly attempting to calculate
sequences per group that summed to an integer value greater than zero.
However, the way I implemented random queue sizes inside the `PriorityQueue`
class in the rewrite prevented me from using a similar "multiple
attempts" approach.

This commit redesigns the way we create priority queues. In the case
where we already know the number of sequences per group in the first
pass, we create an appropriately-sized priority queue for each group as
we encounter it. There is no possibility that the sum of these queue
sizes could be zero.

In the case where we need to calculate the number of sequences per group
from the first pass, we already know all possible groups and can create
their priority queues in bulk. The new `create_queues_by_group` function
allows us to create fixed-sized or randomly-sized queues and also make
multiple attempts when queue sizes sum to zero. As a result, the
`PriorityQueue` class is much simpler (it requires no logic about random
max sizes) and we can actually test the fixed and random behaviors more
carefully with doctests for `create_queues_by_group`.
@codecov
Copy link

codecov bot commented Aug 13, 2021

Codecov Report

Merging #759 (5db04fc) into master (ae80716) will increase coverage by 0.10%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #759      +/-   ##
==========================================
+ Coverage   33.55%   33.66%   +0.10%     
==========================================
  Files          41       41              
  Lines        5871     5882      +11     
  Branches     1419     1423       +4     
==========================================
+ Hits         1970     1980      +10     
- Misses       3816     3817       +1     
  Partials       85       85              
Impacted Files Coverage Δ
augur/filter.py 68.02% <80.00%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae80716...5db04fc. Read the comment docs.

@huddlej huddlej merged commit 52a41c3 into master Aug 13, 2021
@huddlej huddlej deleted the fix-probabilistic-sampling branch August 13, 2021 23:12
@huddlej huddlej added this to the Patch release 12.1.1 milestone Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant