filter: Rewrite priority queue logic with pandas functions #809

- remove `class PriorityQueue` - use `prioritized_metadata` DataFrame in place of `queues_per_group` - repurpose `create_queues_per_group` to `create_sizes_per_group` - other logical refactoring: - use global dummy group key and value - key is `list`: pd.DataFrame.groupby does not take a tuple as grouping key, also our `--group-by` is stored as list already. - value is `tuple: `get_groups_for_subsampling` currently returns group values in this form. - use records_per_group for _dummy - replace conditional logic of `records_per_group is not None` with `group_by` - add functional tests

…g logic

This test currently fails with a pandas-specific index error.

Instead of calculating a new (year, month) tuple when users group by month, add a "year" key to the list of group fields. This fixes a pandas indexing bug where calling `nlargest` on a SeriesGroupBy object that has a year and month tuple key for month causes pandas to think the single month key is a MultiIndex that should be a list. Although this commit is motivated to fix this pandas issue, this implementation of the year/month disambiguation is simpler and a more idiomatic pandas implementation that wouldn't have been possible in the original augur filter implementation (before we switched to pandas for metadata parsing).

Simplifies unit tests and doctests by expecting a single value for each month instead of a tuple.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Rewrite priority queue logic with pandas functions #809

filter: Rewrite priority queue logic with pandas functions #809

Commits on Dec 10, 2021

Commits on Dec 11, 2021

Commits on Dec 17, 2021

filter: Rewrite priority queue logic with pandas functions #809

Are you sure you want to change the base?

filter: Rewrite priority queue logic with pandas functions #809

Commits on Dec 10, 2021

Commits on Dec 11, 2021

Commits on Dec 17, 2021