Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matUtils updates: new subcommand "fix"; improvements to MAT::Tree:move_node and mask --move-nodes #357

Merged
merged 3 commits into from
Nov 2, 2023

Conversation

AngieHinrichs
Copy link
Contributor

This bundles several improvements to matUtils. Let me know if you would prefer for them to be pulled out into separate PRs.

  1. I added a new subcommand fix to find and fix a recurring pattern that causes trouble for lineage designation: often there is a node whose mutation is simply a reversion of its grandparent's mutation, for example the final node in a path like this:
    ... > A1C > G2T > C3A > T2G
    In that case it would be preferable to move that final node to ... > A1C > C3A, i.e. to move it to become a child of its great-grandparent with only its parent's mutation. As long as the grandparent, parent and final node all have only one mutation each, that operation is parsimony-preserving. It represents two independent occurrences of C3A on the A1C branch, as opposed to only one occurrence of C3A followed by an immediate reversion of the previous mutation in the path. This is motivated by the Pango lineage team's experience of SARS-CoV-2 with advantageous Spike mutations occurring multiple times and causing an increase in transmissions.

  2. I found that MAT::Tree::collapse_tree was moving nodes without first looking to see if the new parent node already had a child with the same mutation(s) as the incoming node. This could result in the new parent node having multiple children with the same mutation(s), which causes incorrect structure and trouble for correctly annotating lineages. So I updated MAT::Tree:move_node to first search the new parent node's children for the same mutations as the moved node. Both the existing child and the moved node could be either an internal node or a leaf node (sample); all combinations are handled. When both nodes are internal nodes, the children of the moved node become children of the new parent's existing child, and the moved node is then removed. Since the moved node might be removed, MAT::Tree::collapse_tree could no longer use BFS order, so I changed it to start with the deepest nodes and work back towards root.

  3. mask --move-nodes used to require that the new parent have exactly the same set of mutations as the original parent. I added support for a new parent with a strict subset of the original parent's mutations, adding the original parent's extra mutations to the moved node. [Then I could manually make moves like the new fix subcommand's moves. I implemented this before fix, for testing.] I also found that mask --mask-mutations was doing more copying than necessary and sped it up a bit.

…rent) that already has the same mutation(s) as the node to be moved. If a child with the same mutations is found then merge the moved node with the existing child instead of adding it as a new separate child. Handle all combinations of leaf or internal node type for existing child and moved node. Tree::collapse_tree used a breadth-first order for nodes which caused trouble when child nodes were removed due to merging during moves, and later visited in original BFS order. I changed it to recursively descend the tree and process the deepest nodes first.
…for each masked site, instead of copying all mutations that aren't masked, which is almost all mutations in the branch, make a usually-empty list of only the mutations that must be masked and erase those as needed.

2. In moveNodes (--move-nodes), instead of erroring out when the new parent's mutations are not exactly the same as he old parent's, also support the case of a new parent having a strict subset of the old parent's mutations.  In that case, add the mutations that the new parent doesn't have to the node and move the node to become a child of the new parent.
…ersion of their grandparent's only mutation, a pattern that causes trouble for several Pango lineages, and moves those nodes to become children of their great-grandparents, having only the parent's mutation.
@yatisht yatisht merged commit b96680c into yatisht:master Nov 2, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants