Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in read_metadata update pandas series object name to reflect name col… #564

Merged
merged 2 commits into from
Jun 26, 2020

Conversation

akshaysu12
Copy link
Contributor

…umn value from metadata

Description of proposed changes

Util function read_metadata should throw a value error if the column titled "name" has duplicate values.

The pandas series object was using it's own default name attribute to check for duplicates. This name was being set to the index of the row instead of the value under the name column.

Related issue(s)

Fixes #563 - Metadata file with duplicate values in "name" column does not throw an error.

Testing

What steps should be taken to test the changes you've proposed?

  1. Create a metadata file with column titled name (instead of a column titled strain)

  2. Add rows to the metadata file such that the name column has duplicate values (rows 0,1 have same value for name). I modified data/metadata.tsv in the zika tutorial.

  3. Run augur filter using the metadata file created.
    ex: augur filter --sequences data/sequences.fasta --metadata data/metadata_name.tsv --exclude config/dropped_strains.txt --output results/filtered.fasta --group-by country year month --sequences-per-group 20

  4. See that error is now thrown.

If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Unit tests for read_metadata have been added/expanded in test_utils.py

Thank you for contributing to Nextstrain!

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #564 into master will increase coverage by 0.24%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
+ Coverage   23.70%   23.95%   +0.24%     
==========================================
  Files          32       32              
  Lines        5159     5160       +1     
  Branches     1300     1300              
==========================================
+ Hits         1223     1236      +13     
+ Misses       3885     3876       -9     
+ Partials       51       48       -3     
Impacted Files Coverage Δ
augur/utils.py 30.97% <100.00%> (+2.89%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ead368...ab9cc74. Read the comment docs.

Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @akshaysu12!

I am wondering whether a better solution would be to loop through keys ["strain", "name"], deduplicate the code block, and do all queries to the data frame with named keys instead of attributes (i.e. ["name"] instead of ".name").

I am referring to code following:

if hasattr(val, "strain"):

@akshaysu12
Copy link
Contributor Author

Yes, I agree a small refactor makes sense here. Thanks for the review!

@huddlej
Copy link
Contributor

huddlej commented Jun 25, 2020

Thank you, @akshaysu12! Although a refactor would make the code slightly more readable, I would be happy to merge this as it is.

Your tests actually reveal several other larger issues with the read_metadata function including:

  • empty, invalid, or missing filenames should raise an error and not return an empty result set
  • an existing file that doesn't have a strain or name column should also return an error instead of an empty set

Unless you've already put work into the refactor and are about to push (or @rneher feels strongly about the refactor), I'd recommend holding off. I'll create separate issues for the read_metadata function that use your new unit tests as documentation of the unexpected behavior.

@akshaysu12
Copy link
Contributor Author

@huddlej I haven't started working on the refactor. Unless @rneher has any objections I'd be happy to see the code go through as is. Thank you for taking a look!

@huddlej huddlej merged commit 7afb691 into nextstrain:master Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metadata file with duplicate values in "name" column does not throw an error.
4 participants