Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Fix groupby with incomplete dates #808

Merged
merged 2 commits into from
Dec 10, 2021

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Dec 10, 2021

This PR fixes #807.

Summary

augur/augur/filter.py

Lines 916 to 918 in 1adfed8

date_cols = ['year', 'month', 'day']
df_dates = (metadata['date'].str.split('-', n=2, expand=True)
.set_axis(date_cols, axis=1))

set_axis here raises ValueError when the entire metadata chunk has incomplete dates (e.g. 2020), since the single column returned by pandas.Series.str.split can't be mapped to all 3 date_cols.

This PR addresses the issue with:

  1. Call set_axis with the correct amount of column names based on pandas.Series.str.split.
  2. Create missing columns, filling with pandas.NA.

New code:

augur/augur/filter.py

Lines 916 to 921 in a6cdb3e

date_cols = ['year', 'month', 'day']
df_dates = metadata['date'].str.split('-', n=2, expand=True)
df_dates = df_dates.set_axis(date_cols[:len(df_dates.columns)], axis=1)
missing_date_cols = set(date_cols) - set(df_dates.columns)
for col in missing_date_cols:
df_dates[col] = pd.NA

Testing

  • Added broken tests and fixed accordingly.

@codecov
Copy link

codecov bot commented Dec 10, 2021

Codecov Report

Merging #808 (a6cdb3e) into master (1adfed8) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #808      +/-   ##
==========================================
+ Coverage   33.76%   33.81%   +0.04%     
==========================================
  Files          41       41              
  Lines        5896     5900       +4     
  Branches     1464     1465       +1     
==========================================
+ Hits         1991     1995       +4     
  Misses       3822     3822              
  Partials       83       83              
Impacted Files Coverage Δ
augur/filter.py 68.16% <100.00%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1adfed8...a6cdb3e. Read the comment docs.

@victorlin
Copy link
Member Author

victorlin commented Dec 10, 2021

@huddlej from #807 (comment):

Add unit test to confirm that this error occurs with missing month/day values

Done: 1e88686

Check for the number elements from the split and only assign as many columns (date_cols) as there are elements in the split output.

I did this to fix set_axis and added an additional step to create missing columns with NA value. That way, the subsequent ambiguous dates logic can be applied.

Also, the results may differ when some data have year/month/day values and others in the chunk don't. Would be worth testing this case, too.

This is already tested here:

def test_filter_groupby_skip_missing_month(self, valid_metadata: pd.DataFrame):
groups = ['country', 'year', 'month']
metadata = valid_metadata.copy()
metadata.at["SEQ_2", "date"] = "2020"
strains = metadata.index.tolist()
group_by_strain, skipped_strains = get_groups_for_subsampling(strains, metadata, group_by=groups)
assert group_by_strain == {
'SEQ_1': ('A', 2020, (2020, 1)),
'SEQ_3': ('B', 2020, (2020, 3)),
'SEQ_4': ('B', 2020, (2020, 4)),
'SEQ_5': ('B', 2020, (2020, 5))
}
assert skipped_strains == [{'strain': 'SEQ_2', 'filter': 'skip_group_by_with_ambiguous_month', 'kwargs': ''}]

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and fixes the issue in the original dataset where I found it. Thank you for the quick work on this, @victorlin! If you can merge this, I'll release it today in 13.0.5.

@victorlin victorlin merged commit 333a36f into master Dec 10, 2021
@victorlin victorlin deleted the victorlin/filter/fix-groupby-missing-date branch December 10, 2021 19:00
@huddlej huddlej added this to the Patch release 13.0.5 milestone Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filter: Group by fails when grouping by date when data have missing month/day in the date field
2 participants