Skip to content

Commit

Permalink
Implicitly group by year and month for month group
Browse files Browse the repository at this point in the history
Instead of calculating a new (year, month) tuple when users group by
month, add a "year" key to the list of group fields. This fixes a pandas
indexing bug where calling `nlargest` on a SeriesGroupBy object that has
a year and month tuple key for month causes pandas to think the single
month key is a MultiIndex that should be a list. Although this commit is
motivated to fix this pandas issue, this implementation of the
year/month disambiguation is simpler and a more idiomatic pandas
implementation that wouldn't have been possible in the original augur
filter implementation (before we switched to pandas for metadata
parsing).
  • Loading branch information
huddlej committed Dec 17, 2021
1 parent 897d00e commit 966da1d
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions augur/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -991,8 +991,7 @@ def expand_date_col(metadata: pd.DataFrame, group_by_set: set) -> Tuple[pd.DataF
"filter": "skip_group_by_with_ambiguous_month",
"kwargs": "",
})
# month = (year, month)
metadata_new['month'] = list(zip(metadata_new['year'], metadata_new['month']))

# TODO: support group by day
return metadata_new, skipped_strains

Expand Down Expand Up @@ -1279,6 +1278,11 @@ def run(args):
random_generator = np.random.default_rng(args.subsample_seed)
priorities = defaultdict(random_generator.random)

# When grouping by month, we implicitly group by year and month to avoid
# grouping across years meaninglessly.
if len(group_by) == 1 and group_by[0] == "month":
group_by = ["year", "month"]

# Setup metadata output. We track whether any records have been written to
# disk yet through the following variables, to control whether we write the
# metadata's header and open a new file for writing.
Expand Down

0 comments on commit 966da1d

Please sign in to comment.