Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count insertion/deletion events once in pairwise distances #698

Merged
merged 3 commits into from
Apr 13, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Mar 17, 2021

Description of proposed changes

Counts insertion/deletion (indel) events between sequences as single events instead of counting each gap character independently. Adds logic to support user-defined weights for gaps between specific characters, aggregating weights across all potential mismatches in a sequence-specific distance map, and aggregating weights across all sites in an indel event.

Related issue(s)

Fixes #692

Testing

Adds doctests for the new expected behavior.

@huddlej huddlej self-assigned this Mar 17, 2021
@huddlej
Copy link
Contributor Author

huddlej commented Mar 17, 2021

@benjaminotter No worries if you're too busy to check this out, but I thought you might be interested in this Augur module that calculates pairwise distances between sequences. If you find this interesting, you may want to take a shot at issue #693 which adds a similar kind of functionality to augur distance.

@codecov
Copy link

codecov bot commented Mar 17, 2021

Codecov Report

Merging #698 (e392e09) into master (c663c95) will increase coverage by 0.23%.
The diff coverage is 100.00%.

❗ Current head e392e09 differs from pull request most recent head 00fe015. Consider uploading reports for the commit 00fe015 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #698      +/-   ##
==========================================
+ Coverage   31.28%   31.51%   +0.23%     
==========================================
  Files          41       41              
  Lines        5655     5674      +19     
  Branches     1367     1373       +6     
==========================================
+ Hits         1769     1788      +19     
  Misses       3812     3812              
  Partials       74       74              
Impacted Files Coverage Δ
augur/distance.py 45.89% <100.00%> (+8.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c663c95...00fe015. Read the comment docs.

@huddlej huddlej added this to the Next release milestone Mar 31, 2021
@huddlej
Copy link
Contributor Author

huddlej commented Apr 12, 2021

@benjaminotter Would you be up for trying to resolve the merge conflicts in this PR? The code this PR modifies overlaps with the code you added to support ignored characters. Tests already exist, so when you get the conflict resolved, you can do the usual ./run_tests.sh to confirm that everything works as expected.

@benjaminotter
Copy link
Contributor

@huddlej, the latest commit resolves the merge conflicts. I used the github interface to do so, which merged master into this branch. I then ran the tests locally and all tests passed.

Counts insertion/deletion (indel) events between sequences as single
events instead of counting each gap character independently. Adds logic
to support user-defined weights for gaps between specific characters,
aggregating weights across all potential mismatches in a
sequence-specific distance map, and aggregating weights across all sites
in an indel event.
Adds a test and code for the test to handle an edge case where the
default weight is greater than any of the sequence-specific weights and
the event is an indel.
@huddlej
Copy link
Contributor Author

huddlej commented Apr 13, 2021

Thank you for making these changes and testing them out, @benjaminotter! I'm sorry I forgot to describe the process we use to resolve these types of conflicts. We generally avoid merge commits from master back into a development branch in favor of rebasing the development branch onto master and resolving conflicts during that rebase. This is primarily a Bedford Lab convention (I suspect Richard does not care as much about this, but it's worth checking anyway) that we use to maintain a "clean" git history (so when you look at a graph of the git history, you generally only see development branches like this merged into master with few other merges).

Your edits here were very helpful in confirming the changes that needed to be made, though, and I've used them to perform a rebase onto master and reconfirm that tests passed locally. I'll merge this once it completes the CI.

If you are interested in learning more about how we use git rebase, the git book has a nice chapter on rewriting history and we can also have a quick video chat where I can demo an example of how I use rebase. I didn't get any training in this area of software development during my computer science degree, but it's been super helpful to learn and empowers you to do all kinds of quick git surgery that you'd normally avoid.

@huddlej huddlej merged commit 727b0e4 into master Apr 13, 2021
@huddlej huddlej deleted the count-indels-once branch April 13, 2021 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distance: Count indels as single events
2 participants