Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update region names in frequency weights to match metadata #43

Merged
merged 1 commit into from
Dec 17, 2019

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Dec 16, 2019

Resolves an issue with KDE frequency weighting where the regions used to weight
frequencies were represented differently in the metadata and the weights
JSON. For example, the metadata contains prettified regions like "North America"
while the weights JSON referred to this region as "north america". This issue
was revealed after introducing code to augur frequencies that eliminated all
unrepresented weight attributes from consideration. When no weights were
represented by the given tips (because of the casing difference shown above),
weighted frequency estimation failed with a ValueError.

[1] nextstrain/augur#420

Resolves an issue with KDE frequency weighting where the regions used to weight
frequencies were represented differently in the metadata and the weights
JSON. For example, the metadata contains prettified regions like "North America"
while the weights JSON referred to this region as "north america". This issue
was revealed after introducing code to augur frequencies that eliminated all
unrepresented weight attributes from consideration. When no weights were
represented by the given tips (because of the casing difference shown above),
weighted frequency estimation failed with a ValueError.

[1] nextstrain/augur#420
@huddlej huddlej requested a review from trvrb December 16, 2019 20:34
@rneher
Copy link
Member

rneher commented Dec 17, 2019

How does this interact with the region spec in the Snakefile:
https://github.com/nextstrain/seasonal-flu/blob/master/Snakefile_base#L8

@rneher
Copy link
Member

rneher commented Dec 17, 2019

there are a few more places where the old region names still stick around. These might cause some issues with the mutation frequency calculation. I can go through this this weekend.

@trvrb
Copy link
Member

trvrb commented Dec 17, 2019

@huddlej ---

Something else is going on here and I don't immediately know what that is. If you run the 12y test build from Travis CI you'll see that all the frequency vectors are zeros. This is also the case for full "live" builds.

I confirmed that augur frequencies --method kde is producing zeros even with updated config/frequency_weights_by_region.json where these entries match entries in the region column in metadata.

@trvrb
Copy link
Member

trvrb commented Dec 17, 2019

Actually, I'm pretty sure this is due to nextstrain/augur#420. If we go back to last week, https://nextstrain.org/flu/seasonal/h3n2/ha/3y was built with Augur v6.0.0 and frequencies are non-zero (although I understand that region weighting may be messed up).

@huddlej
Copy link
Contributor Author

huddlej commented Dec 17, 2019

Yes, I see the issue now. PR 420 introduced more stringent filtering on the weight attribute values (case-sensitive), but there is a line of code after that filtering that lowercases the weight attributes.

I've pushed a fix to augur in PR 426 that removes the lowercasing. I can confirm that this produces non-zero frequencies but it also looks like augur PR 420 didn't completely fix the original problem since these frequencies across regions don't sum to 100%:

image

This has been a problem since we originally implemented the weighted frequencies but it doesn't appear as an issue when there are enough samples per region in the tree across all timepoints.

@trvrb trvrb merged commit c04876a into master Dec 17, 2019
@trvrb trvrb deleted the fix-region-weights branch December 17, 2019 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants