in read_metadata update pandas series object name to reflect name col… #564

akshaysu12 · 2020-06-16T20:16:27Z

…umn value from metadata

Description of proposed changes

Util function read_metadata should throw a value error if the column titled "name" has duplicate values.

The pandas series object was using it's own default name attribute to check for duplicates. This name was being set to the index of the row instead of the value under the name column.

Related issue(s)

Fixes #563 - Metadata file with duplicate values in "name" column does not throw an error.

Testing

What steps should be taken to test the changes you've proposed?

Create a metadata file with column titled name (instead of a column titled strain)
Add rows to the metadata file such that the name column has duplicate values (rows 0,1 have same value for name). I modified data/metadata.tsv in the zika tutorial.
Run augur filter using the metadata file created.
ex: augur filter --sequences data/sequences.fasta --metadata data/metadata_name.tsv --exclude config/dropped_strains.txt --output results/filtered.fasta --group-by country year month --sequences-per-group 20
See that error is now thrown.

If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Unit tests for read_metadata have been added/expanded in test_utils.py

Thank you for contributing to Nextstrain!

…umn value from metadata

codecov · 2020-06-16T20:21:44Z

Codecov Report

Merging #564 into master will increase coverage by 0.24%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
+ Coverage   23.70%   23.95%   +0.24%     
==========================================
  Files          32       32              
  Lines        5159     5160       +1     
  Branches     1300     1300              
==========================================
+ Hits         1223     1236      +13     
+ Misses       3885     3876       -9     
+ Partials       51       48       -3

Impacted Files	Coverage Δ
augur/utils.py	`30.97% <100.00%> (+2.89%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ead368...ab9cc74. Read the comment docs.

rneher

thanks, @akshaysu12!

I am wondering whether a better solution would be to loop through keys ["strain", "name"], deduplicate the code block, and do all queries to the data frame with named keys instead of attributes (i.e. ["name"] instead of ".name").

I am referring to code following:

augur/augur/utils.py

Line 93 in ab9cc74

if hasattr(val, "strain"):

akshaysu12 · 2020-06-25T16:08:33Z

Yes, I agree a small refactor makes sense here. Thanks for the review!

huddlej · 2020-06-25T22:43:36Z

Thank you, @akshaysu12! Although a refactor would make the code slightly more readable, I would be happy to merge this as it is.

Your tests actually reveal several other larger issues with the read_metadata function including:

empty, invalid, or missing filenames should raise an error and not return an empty result set
an existing file that doesn't have a strain or name column should also return an error instead of an empty set

Unless you've already put work into the refactor and are about to push (or @rneher feels strongly about the refactor), I'd recommend holding off. I'll create separate issues for the read_metadata function that use your new unit tests as documentation of the unexpected behavior.

akshaysu12 · 2020-06-26T01:08:41Z

@huddlej I haven't started working on the refactor. Unless @rneher has any objections I'd be happy to see the code go through as is. Thank you for taking a look!

in read_metadata update pandas series object name to reflect name col…

cfa555b

…umn value from metadata

Merge branch 'master' into FixDuplicateName

ab9cc74

rneher reviewed Jun 25, 2020

View reviewed changes

huddlej merged commit 7afb691 into nextstrain:master Jun 26, 2020

huddlej mentioned this pull request Jun 26, 2020

read_metadata lacks standard error handling #576

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in read_metadata update pandas series object name to reflect name col… #564

in read_metadata update pandas series object name to reflect name col… #564

akshaysu12 commented Jun 16, 2020

codecov bot commented Jun 16, 2020 •

edited

Loading

rneher left a comment •

edited

Loading

akshaysu12 commented Jun 25, 2020

huddlej commented Jun 25, 2020

akshaysu12 commented Jun 26, 2020

in read_metadata update pandas series object name to reflect name col… #564

in read_metadata update pandas series object name to reflect name col… #564

Conversation

akshaysu12 commented Jun 16, 2020

Description of proposed changes

Related issue(s)

Testing

If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Thank you for contributing to Nextstrain!

codecov bot commented Jun 16, 2020 • edited Loading

Codecov Report

rneher left a comment • edited Loading

Choose a reason for hiding this comment

akshaysu12 commented Jun 25, 2020

huddlej commented Jun 25, 2020

akshaysu12 commented Jun 26, 2020

codecov bot commented Jun 16, 2020 •

edited

Loading

rneher left a comment •

edited

Loading