Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore specific characters in all distance calculations #707

Merged
merged 2 commits into from
Mar 31, 2021

Conversation

benjaminotter
Copy link
Contributor

Description of proposed changes

Adds support for ignoring specific characters during distance calculation. The ignored characters are defined in the distance map as a list of characters:

{
    "default": 1,
    "ignored_characters": ["N", "-"],
    "map": {}
}

The specification of those characters in the distance map is optional, making it compatible with distance maps that don't specify any characters to ignore.

Related issue(s)

Fixes #693

Testing

Added doctests as suggested in #693:

Ignore specific characters defined in the distance map.

>>> node_a_sequences = {"gene": "ACTGG"}
>>> node_b_sequences = {"gene": "A--GN"}
>>> distance_map = {"default": 1, "ignored_characters":["-"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1
>>> distance_map = {"default": 1, "ignored_characters":["-", "N"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0

…tance calculation. Ignored characters can be specified in the distance map as a list of characters.
@codecov
Copy link

codecov bot commented Mar 29, 2021

Codecov Report

Merging #707 (328c1ff) into master (5c12fd8) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #707      +/-   ##
==========================================
+ Coverage   30.89%   30.90%   +0.01%     
==========================================
  Files          41       41              
  Lines        5648     5649       +1     
  Branches     1365     1365              
==========================================
+ Hits         1745     1746       +1     
  Misses       3828     3828              
  Partials       75       75              
Impacted Files Coverage Δ
augur/distance.py 37.79% <100.00%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c12fd8...328c1ff. Read the comment docs.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @benjaminotter! This looks good to me. I made a couple comments about alternate ways to think about this implementation, but this is good to merge.

Comment on lines 262 to 265
if "ignored_characters" in distance_map:
ignored_characters = distance_map["ignored_characters"]
else:
ignored_characters = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is great and very readable, although you can also use the following idiom for slightly less verbose code:

ignored_characters = distance_map.get("ignored_characters", [])

Comment on lines 271 to 272
if node_a_sequences[gene][site] not in ignored_characters and node_b_sequences[gene][site] not in ignored_characters:
if node_a_sequences[gene][site] != node_b_sequences[gene][site]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and works for me. One minor consideration is that the tests in the first line will almost always evaluate to True, while the second line will be True much less often. In practice the order of these boolean checks won't make the distance calculations that much slower, but you could consider an approach like this that uses Python's greedy evaluation of logical operators to only check ignored characters when there is a mismatch:

if (node_a_sequences[gene][site] != node_b_sequences[gene][site] and
    node_a_sequences[gene][site] not in ignored_characters and
    node_b_sequences[gene][site] not in ignored_characters):
        # Do some stuff

@huddlej huddlej marked this pull request as ready for review March 30, 2021 17:50
@benjaminotter
Copy link
Contributor Author

@huddlej Thank you for the comments, i implemented both improvements in the latest commit.

@huddlej huddlej merged commit f0ff957 into master Mar 31, 2021
@huddlej huddlej deleted the ignore-characters-in-distance branch March 31, 2021 16:17
@huddlej huddlej added this to the Next release milestone Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distance: Allow users to ignore specific characters in all distance calculations
2 participants