Implicitly group by year and month for month group

Instead of calculating a new (year, month) tuple when users group by month, add a "year" key to the list of group fields. This fixes a pandas indexing bug where calling `nlargest` on a SeriesGroupBy object that has a year and month tuple key for month causes pandas to think the single month key is a MultiIndex that should be a list. Although this commit is motivated to fix this pandas issue, this implementation of the year/month disambiguation is simpler and a more idiomatic pandas implementation that wouldn't have been possible in the original augur filter implementation (before we switched to pandas for metadata parsing).
nextstrain · Dec 17, 2021 · 966da1d · 966da1d
1 parent 897d00e
commit 966da1d
Showing 1 changed file with 6 additions and 2 deletions.
diff --git a/augur/filter.py b/augur/filter.py
@@ -991,8 +991,7 @@ def expand_date_col(metadata: pd.DataFrame, group_by_set: set) -> Tuple[pd.DataF
                 "filter": "skip_group_by_with_ambiguous_month",
                 "kwargs": "",
             })
-        # month = (year, month)
-        metadata_new['month'] = list(zip(metadata_new['year'], metadata_new['month']))
+
     # TODO: support group by day
     return metadata_new, skipped_strains
 
@@ -1279,6 +1278,11 @@ def run(args):
             random_generator = np.random.default_rng(args.subsample_seed)
             priorities = defaultdict(random_generator.random)
 
+        # When grouping by month, we implicitly group by year and month to avoid
+        # grouping across years meaninglessly.
+        if len(group_by) == 1 and group_by[0] == "month":
+            group_by = ["year", "month"]
+
     # Setup metadata output. We track whether any records have been written to
     # disk yet through the following variables, to control whether we write the
     # metadata's header and open a new file for writing.