Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Rewrite priority queue logic with pandas functions #809

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Commits on Dec 10, 2021

  1. Configuration menu
    Copy the full SHA
    1b106d5 View commit details
    Browse the repository at this point in the history
  2. rewrite PriorityQueue logic with pandas functions

    - remove `class PriorityQueue`
    - use `prioritized_metadata` DataFrame in place of `queues_per_group`
    - repurpose `create_queues_per_group` to `create_sizes_per_group`
    - other logical refactoring:
        - use global dummy group key and value
            - key is `list`: pd.DataFrame.groupby does not take a tuple as grouping key, also our `--group-by` is stored as list already.
            - value is `tuple: `get_groups_for_subsampling` currently returns group values in this form.
        - use records_per_group for _dummy
            - replace conditional logic of `records_per_group is not None` with `group_by`
    - add functional tests
    victorlin committed Dec 10, 2021
    Configuration menu
    Copy the full SHA
    dc0eda2 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4e3e155 View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2021

  1. Configuration menu
    Copy the full SHA
    9ac13ea View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9cf2264 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e01d302 View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2021

  1. Add test for grouping by month alone

    This test currently fails with a pandas-specific index error.
    huddlej committed Dec 17, 2021
    Configuration menu
    Copy the full SHA
    897d00e View commit details
    Browse the repository at this point in the history
  2. Implicitly group by year and month for month group

    Instead of calculating a new (year, month) tuple when users group by
    month, add a "year" key to the list of group fields. This fixes a pandas
    indexing bug where calling `nlargest` on a SeriesGroupBy object that has
    a year and month tuple key for month causes pandas to think the single
    month key is a MultiIndex that should be a list. Although this commit is
    motivated to fix this pandas issue, this implementation of the
    year/month disambiguation is simpler and a more idiomatic pandas
    implementation that wouldn't have been possible in the original augur
    filter implementation (before we switched to pandas for metadata
    parsing).
    huddlej committed Dec 17, 2021
    Configuration menu
    Copy the full SHA
    966da1d View commit details
    Browse the repository at this point in the history
  3. Update unit and doc tests to match new month group

    Simplifies unit tests and doctests by expecting a single value for each
    month instead of a tuple.
    huddlej committed Dec 17, 2021
    Configuration menu
    Copy the full SHA
    eea96fb View commit details
    Browse the repository at this point in the history