Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distance: Allow users to ignore specific characters in all distance calculations #693

Closed
huddlej opened this issue Mar 12, 2021 · 2 comments · Fixed by #707
Closed

distance: Allow users to ignore specific characters in all distance calculations #693

huddlej opened this issue Mar 12, 2021 · 2 comments · Fixed by #707
Assignees
Labels
enhancement New feature or request priority: moderate To be resolved after high priority issues

Comments

@huddlej
Copy link
Contributor

huddlej commented Mar 12, 2021

Context

When calculating distances between sequences, augur distance currently counts all mismatches toward the total distance based on weights in the distance map. However, we often want to ignore mismatches between ambiguous characters such as Ns for nucleotide sequences and Xs for amino acid sequences. While users can technically accomplish this behavior by defining a sequence-specific distance map, they would need to define such a map from each ambiguous character to all other possible characters at all sites.

Description

When calculating distances between nodes, users should be able to ignore specific characters without defining a full distance map for those characters.

Possible solutions

One potential solution is to provide a command line argument so users can specify which characters to ignore in distance calculations. This interface could look like:

augur distance \
    --ignore-characters X N -
    --tree ...

We probably want to avoid a situation where we need to maintain lists of ambiguous characters to ignore for different sequence types (even though this list is probably just a couple of characters). Allowing users to specify which characters to ignore makes the interface flexible enough to support other use cases in the future, too.

Another solution would be to define these ignored characters in the distance map like so (for a simple Hamming distance between amino acid or nucleotide sequences that ignores X's and N's):

{
    "name": "Hamming distance",
    "default": 1,
    "ignored_characters": [
        "N",
        "X"
    ],
    "map": {}
}

This implementation keeps important information about the distance calculations close to the map itself and allows different behaviors per map in the same distance command (through multiple inputs to the --map argument).

@huddlej huddlej added enhancement New feature or request priority: moderate To be resolved after high priority issues needs triage Needs triage by a Nextstrain team member labels Mar 12, 2021
@huddlej huddlej removed the needs triage Needs triage by a Nextstrain team member label Mar 26, 2021
@huddlej
Copy link
Contributor Author

huddlej commented Mar 26, 2021

After thinking about the possible solutions proposed above, I've decided that storing the ignored characters configuration in the distance maps is the most flexible solution. Here are some potential tests for this new functionality:

"""
Ignore specific characters defined in the distance map.

>>> node_a_sequences = {"gene": "ACTGG"}
>>> node_b_sequences = {"gene": "A--GN"}
>>> distance_map = {"default": 1, ignored_characters=["-"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1
>>> distance_map = {"default": 1, ignored_characters=["-", "N"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0
"""

@benjaminotter, I've assigned this issue to you, as the first of a couple small but important issues we'd like to resolve. Can you add these doctests to the get_distance_between_nodes in augur/distance.py, confirm that they fail by running ./run_tests.sh -k get_distance_between_nodes, then write the Python code necessary to make these tests pass? See the related PR that adds support for indels in the distance command for an example of how we've done this before.

In case you didn't have to do this when you added the index command, you'll need to setup a dev environment for Augur. You can do this with Conda like so:

# Create a Conda environment for Augur development, installing third-party tools.
conda create -n augur -c conda-forge -c bioconda augur
conda activate augur

# Clone the repo.
git clone https://github.com/nextstrain/augur.git
cd augur

# Create your branch to work on this issue.
git checkout -b ignore-characters-in-distance

# Install the development version of the repo.
python3 -m pip install -e .[dev]

# Run just the tests you're working on (test-driven development mode).
# These should fail initially and pass after you add the correct code.
./run_tests.sh -k get_distance_between_nodes

# Run all tests, to make sure there haven't been any regressions elsewhere.
./run_tests.sh

Let me know if you have any questions about this issue, the suggested solution, or the dev environment. Thanks!

@benjaminotter
Copy link
Contributor

Thank you, @huddlej, for the detailed instructions on how to approach this issue.
I'll have a look at this and let you know if there are any questions regarding the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: moderate To be resolved after high priority issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants