Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export multiple trees #1450

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

export multiple trees #1450

wants to merge 3 commits into from

Conversation

jameshadfield
Copy link
Member

See commit messages for details

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done reviewing but sending this in before going off to lunch

Comment on lines +23 to +24
Use augur refine to add internal node names
$ ${AUGUR} refine --tree small_tree_raw.nwk --output-tree small_tree.nwk &>/dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: do this in generate_newick directly. I just tested locally and something like this seems to work:

- echo "($(generate_newick $((n-1))),$prefix$n)"
+ echo "($prefix$n,$(generate_newick $((n-1)))):INTERNAL_$n"

This runs much faster than augur refine, especially for the large tree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without running refine we fail validation due to the absence of any divergence values on the tree.

I just tested locally and something like this seems to work:

Do you have a patch which passes the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip validation since validity should be tested elsewhere. Patch: d1c3997

from collections import Counter
dups = [name for name, count in Counter(node_names).items() if count>1]
raise AugurError(f"{len(dups)} node names occur multiple times in the tree: " +
", ".join([f"'{v}'" for v in dups[0:5]]) + ("..." if len(dups)>5 else ""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use repr()/!r instead of manual single quotes

Suggested change
", ".join([f"'{v}'" for v in dups[0:5]]) + ("..." if len(dups)>5 else ""))
", ".join([f"{v!r}" for v in dups[0:5]]) + ("..." if len(dups)>5 else ""))

Also suggested elsewhere:

Comment on lines 905 to 900
required.add_argument('--tree','-t', metavar="newick", nargs='+', required=True, help="Phylogenetic tree(s), usually output from `augur refine`")
required.add_argument('--output', metavar="JSON", required=True, help="Output file (typically for visualisation in auspice)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopt #1445 here for consistency:

Suggested change
required.add_argument('--tree','-t', metavar="newick", nargs='+', required=True, help="Phylogenetic tree(s), usually output from `augur refine`")
required.add_argument('--output', metavar="JSON", required=True, help="Output file (typically for visualisation in auspice)")
required.add_argument('--tree','-t', metavar="newick", nargs='+', action='extend', required=True, help="Phylogenetic tree(s), usually output from `augur refine`")
required.add_argument('--output', metavar="JSON", required=True, help="Output file (typically for visualisation in auspice)")

Comment on lines 1235 to 1237
trees_json = [convert_tree_to_json_structure(tree.root, node_attrs, node_div(tree, node_attrs)) for tree in trees]
data_json["tree"] = trees_json[0] if len(trees_json)==1 else trees_json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always use a list of trees here, even if it's a one-item list? This would simplify internal logic by avoiding the need to check data type on every reference.

Having multiple trees under the singular key tree is also a bit confusing. Would it be too much to put this in a new plural key trees that's more self-descriptive?

Auspice could continue to support a single tree object for backwards compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the final dataset JSON, I'd prefer to keep the single-tree export as it currently is to remain consistent with output many users have become familiar with (and many people have written scripts to expect, as it's the case in nearly all situations).

For the multiple trees export it may help to conceptually think of these as subtrees. The schema has supported an array of (sub)trees since #851 and while trees may be nicer in some regards I don't think it's worth changing the schema here.

Why not always use a list of trees here, even if it's a one-item list? This would simplify internal logic by avoiding the need to check data type on every reference.

In terms of internal implementation, we could use an array which would reduce the conditional logic in the code, and then essentially run line 1236 at the very end to export a single tree object where applicable. I felt this changing of types was more confusing to developers, but 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, makes sense to keep the final JSON as-is.

For internal implementation, I find it confusing when a variable takes on two types especially in an untyped language. Thinking in terms of subtrees, a more concrete suggestion would be to rename internal variables to always assume a list of subtrees instead of tree which is confusingly one or multiple trees. Then at the very end export dynamically based on if it's a singular tree or multiple subtrees. Does that make sense?

I've not worked much on this part of the code, so take my comments with a grain of salt!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can appreciate that maintaining multiple types without any type-hints (which AFAIK is the best python can do) requires developer overhead and won't scale. I chose this pattern as it matches the pattern already in the codebase:

augur/augur/export_v2.py

Lines 113 to 117 in 03ed408

if isinstance(od.get("tree"), list):
for subtree in od['tree']:
order_nodes(subtree)
elif isinstance(od.get("tree"), dict):
order_nodes(od['tree'])
but that doesn't prevent us from improving.

a more concrete suggestion would be to rename internal variables to always assume a list of subtrees

I think this would be more confusing if we continue to use data_json['tree'] (unsure if that's what you're suggesting), but I've implemented this using the new variable trees. Could you check out c40b821?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sorry I should've tried this out first. That doesn't seem any better since we'd still have to keep conditions elsewhere for data_json['tree'] such as

if isinstance(tree, list):
for subtree in tree:
recurse(subtree)
else:
recurse(tree)

I'm fine with dropping c40b821 and keeping how you had it before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without any type-hints (which AFAIK is the best python can do)

We could use something like the following TypedDict for data_json, but enforcement (#1246) and usefulness across text editors is another story.

Tree = Dict[str, Any]

class DataJson(TypedDict):
    tree: Tree or List[Tree]
    meta: ...

Checking for duplicated node names and missing node names is in line
with the schema. Previously some calls to `export v2` would be ok with
missing node names (e.g. see the updated tests in `minify-output.t`) but
any usage with metadata would result in an uncaught error.
Multiple trees ("subtrees") have been available in Auspice since late
2021¹ and part of the associated schema since early 2022². Despite this
there was no way to produce such datasets within Augur itself, and
despite the schema changes the associated `augur validate` command was
never updated to allow them.

This commit adds multi-tree inputs to `augur export v2` as well as
allowing them to validate with our associated validation commands.

¹ <nextstrain/auspice#1442>
² <#851>
Always use a list of trees, even if it's a one-item list. This
simplifies internal logic by avoiding the need to check data type on
every reference. Requested via code review¹

¹ <#1450 (comment)>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants