-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in read_metadata update pandas series object name to reflect name col… #564
Conversation
…umn value from metadata
Codecov Report
@@ Coverage Diff @@
## master #564 +/- ##
==========================================
+ Coverage 23.70% 23.95% +0.24%
==========================================
Files 32 32
Lines 5159 5160 +1
Branches 1300 1300
==========================================
+ Hits 1223 1236 +13
+ Misses 3885 3876 -9
+ Partials 51 48 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @akshaysu12!
I am wondering whether a better solution would be to loop through keys ["strain", "name"]
, deduplicate the code block, and do all queries to the data frame with named keys instead of attributes (i.e. ["name"] instead of ".name").
I am referring to code following:
Line 93 in ab9cc74
if hasattr(val, "strain"): |
Yes, I agree a small refactor makes sense here. Thanks for the review! |
Thank you, @akshaysu12! Although a refactor would make the code slightly more readable, I would be happy to merge this as it is. Your tests actually reveal several other larger issues with the
Unless you've already put work into the refactor and are about to push (or @rneher feels strongly about the refactor), I'd recommend holding off. I'll create separate issues for the |
…umn value from metadata
Description of proposed changes
Util function read_metadata should throw a value error if the column titled "name" has duplicate values.
The pandas series object was using it's own default name attribute to check for duplicates. This name was being set to the index of the row instead of the value under the name column.
Related issue(s)
Fixes #563 - Metadata file with duplicate values in "name" column does not throw an error.
Testing
What steps should be taken to test the changes you've proposed?
Create a metadata file with column titled name (instead of a column titled strain)
Add rows to the metadata file such that the name column has duplicate values (rows 0,1 have same value for name). I modified data/metadata.tsv in the zika tutorial.
Run augur filter using the metadata file created.
ex: augur filter --sequences data/sequences.fasta --metadata data/metadata_name.tsv --exclude config/dropped_strains.txt --output results/filtered.fasta --group-by country year month --sequences-per-group 20
See that error is now thrown.
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?
Unit tests for read_metadata have been added/expanded in test_utils.py
Thank you for contributing to Nextstrain!