feat: allow reconstruction of amino acid sequences #1258

rneher · 2023-07-22T12:46:53Z

augur ancestral so far only dealt with nucleotide sequences, while translation were done codon by codon from the nucleotide sequences in later steps. If translations for tips are available, the ancestral amino acid sequences can be determined analogously via ancestral reconstruction. This is implemented in this commit.

For each node, the node data structure will contain a muts and a aa_muts field. The latter is a dict with gene/cds names as keys and <ancestral><pos><derived> mutations (same as in augur translate). In addition, an annotation and reference sequences are derived stored in the node-data-json.

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

rneher · 2023-07-22T14:14:29Z

Pathogen workflows are currently failing because the node data jsons this produces doesn't contain the reconstructed sequences. I'll re-enable that.

rneher · 2023-07-22T14:26:46Z

augur ancestral has a whole bunch of conditions and flags that are passed around and I am not sure anybody fully see through what they do and why there were put in place. There is is_vcf which is fairly obvious, but the vcf functionality is not used much and not well tested. Then there is the infer_ambiguous which used to determine how the mask is set. But I don't quite understand the rational.

rneher · 2023-07-22T14:27:53Z

tests now fail with

@@ -21,9 +21,9 @@
       "nuc": {
         "end": 10769,
         "start": 1,
-        "strand": "+"
+        "strand": "+",
+        "type": "source"
       }
-    },

which I consider a problem with the tests...

victorlin · 2023-07-24T21:43:43Z

augur/ancestral.py

+    anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']),
+                                       'strand': '+', 'type': 'source'}}


What's the purpose of type: source?

This change is the reason for the failing tests. The test can be easily changed to pass again, but I want to understand the reasoning first.

source is the name of the sequence with respect to which annotations in genbank files are made. If you look at this file for example you see a source field at the beginning of the annotation block. In GFF these are called landmarks. In augur translate we assign this also for the feature type source:

https://github.com/nextstrain/augur/blob/master/augur/translate.py#L395-L400

I added this here just in analogy to translate, I don't think auspice requires it.

While Auspice doesn't use this, the "type" key (with string value) is specified in our node-data JSON annotations schema and exported (auspice) JSON schema so tests shouldn't fail.

jameshadfield · 2023-07-24T22:23:32Z

augur/ancestral.py

+    if not is_vcf and args.genes:
+        from .utils import load_features
+        ## load features; only requested features if genes given
+        features = load_features(args.annotation, args.genes)


I'm not sure whether it's fair to lump it into this PR but our annotations parsing needs a lot of work.

Testing this PR (for HepB) with NC_003977.gb (straight from GenBank):

File "***/augur/augur/ancestral.py", line 272, in run feat = features[gene] KeyError: 'X'

Testing with the nextclade GFF for the same genome:

Couldn't find gene X in GFF or GenBank file Couldn't find gene pre-capsid in GFF or GenBank file Couldn't find gene capsid in GFF or GenBank file Couldn't find gene pol in GFF or GenBank file Couldn't find gene envL in GFF or GenBank file Couldn't find gene envM in GFF or GenBank file Couldn't find gene envS in GFF or GenBank file Read in 0 features from reference sequence file Processing gene: X Traceback (most recent call last): File "/Users/enigma/projects/nextstrain/augur/augur/__init__.py", line 66, in run return args.__command__.run(args) File "/Users/enigma/projects/nextstrain/augur/augur/ancestral.py", line 272, in run feat = features[gene] KeyError: 'X'

The following GFF file works. However as this is intended to be run in conjunction with Nextclade we should really sort the parsing out!

##gff-version 3 #!gff-spec-version 1.21 #!processor NCBI annotwriter ##sequence-region NC_003977.2 1 3182 ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10407 NC_003977.2 RefSeq region 1 3182 . + . ID=NC_003977.2:1..3182;Dbxref=taxon:10407;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=ayw NC_003977.2 RefSeq gene 1376 1840 . + 0 gene=X; NC_003977.2 RefSeq gene 1816 2454 . + 0 gene=pre-capsid; NC_003977.2 RefSeq gene 1903 2454 . + 0 gene=capsid; NC_003977.2 RefSeq gene 2309 4807 . + 0 gene=pol; NC_003977.2 RefSeq gene 2850 4019 . + 0 gene=envL; NC_003977.2 RefSeq gene 3174 4019 . + 0 gene=envM; NC_003977.2 RefSeq gene 157 837 . + 0 gene=envS;

yeah, this is a bit tricky. Our parsing is first limiting itself to source and gene, and then ignores "subfeatures". The GFF parser parses CDS that are children of a gene into feat.subfeatures. So yes, this is something we need to tackle.

What we want load_features to do is to return a source/landmark feature and then all the bits that need translating, probably best as type='CDS' rather than type='gene'.

I added a simpler parser for the nextclade json here:
#1263

huddlej

Thank you for implementing this, @rneher! Below are some initial thoughts from playing with the new interface. I have only tested the new functionality in an ad hoc way from the command line. I'll spend a little more time tomorrow playing with its integration into a workflow. I can mockup functional tests once we settle the interface.

Big picture

I notice that the interface here allows users to reconstruct nucleotide and amino acid sequences in the same command. I had originally been thinking of this new functionality as a drop-in replacement for the standalone scripts like translations_aamut.py so our workflows would continue to have an nt_muts.json and an aa_muts.json. The interface in this PR makes more sense, I think, although we’ll have to be careful in our documentation and tutorials to clarify when you need two rules with augur ancestral and translate (e.g., Zika) vs. when you need one rule with augur ancestral. When we only use one call to ancestral, should we standardize the output filename to just muts.json? I guess we’ll eventually have most pathogens using Nextalign, so maybe there won’t be much confusion in that future.

Interface

For the command line options, I wish the --translations argument accepted a list of specific files instead of a template string. Even it is a bit of extra typing for users, it is clearer to list the specific files and ties into workflow managers better as a dynamic list of inputs. I guess the list becomes a pain for SARS-CoV-2 and other pathogens with a lot of genes, but only when manually running commands from the terminal.

I also wish that the new interface including an argument for --output-translations akin to augur translate’s --alignment-output argument, so we could have reconstructed translations for internal nodes in FASTA format. We use those outputs in multiple places in the seasonal flu workflow, for example. The corresponding argument in translate accepts a template string like 'my_alignment_%GENE.fasta', but I slightly prefer the way the ncov workflow explicitly lists the output files with internal node translations even though the explicit_translation script creates those outputs implicitly. This approach allows you to explicitly reference output FASTA files from other rules. An example command from the seasonal flu repo directory for H3N2 HA would be:

augur ancestral \
  --tree tree.nwk \
  --alignment aligned.fasta \
  --annotation ../../../config/h3n2/ha/genemap.gff \
  --genes SigPep HA1 HA2 \
  --translations nextalign/masked.gene.SigPep.fasta \
                 nextalign/masked.gene.HA1.fasta \
                 nextalign/masked.gene.HA2.fasta \
  --output-node-data muts.json \
  --output-sequences nt_muts.fasta \ # Not currently used in flu workflow
  --output-translations aa_muts.SigPep.fasta \
                        aa_muts.HA1.fasta \
                        aa_muts.HA2.fasta

The new ability to annotate mutations on the branch leading to the inferred root of the tree using the --root-sequence argument is really cool! Does this argument refer to “root” as in the root of the tree? It seems more like the “reference” of the alignment which could also be the root of the tree but doesn’t have to be.

Minor notes

I’d like to add some functional tests for the new interface, once we agree on what that should be (based on discussion on the above).

We’ll need to update the docstring of the ancestral.py module to refer to amino acid sequences in addition to nucleotide sequences.

As noted above, arguments need grouping into different sections and checks for mutually required arguments.

As a very minor note, I wish that internal functions run_ancestral and collect_mutations_and_sequences didn’t need to pass around the state of the input being VCF or not (is_vcf). That the input is VCF feels like a technical detail for the command line interface that affects downstream steps in different ways that need their own flags. For example, I found the original interface for collect_mutations_and_sequences with the mask_ambiguous argument much clearer to read. The calling function(s) could determine whether being a VCF means masking ambiguous characters and that would decouple the internal functions from the details of the input file format. The run_ancestral function could accept an argument for mask_ambiguous that it passes on to the lower level function and it could also accept a boolean argument like use_root_as_reference that would fill the place of is_vcf in the ref=root_sequence if is_vcf else None expression. (Whew! The least important note required the most text to describe…)

rneher · 2023-07-26T07:31:24Z

Thanks @huddlej . A few follow up comments:

nucleotide and amino acid sequences in the same command

my idea here was to pick up the nextalign output where both alignments are available.

I wish the --translations argument accepted a list of specific files

your wish might be granted :) easy change. I had just copied the corresponding (output) parts in translate.py. But we then have to rely on --genes and --translations to be in the same order AND match whatever is in the annotation. In consistencies can be spotted by comparing the length of the annotation with the length of alignment.
UPDATE: changed to be list of files.

--output-translations ...

easy to add -- mostly a question about interface. I was thinking of removing the full ancestral sequences from the node data json.

root vs reference sequence

I agree that "reference" is more natural in this context, but I worried about it being ubiquitously used across workflows for various purposes, most of which don't have anything to do with "mutations". It has, on the other hand, exactly the same purpose as root-sequence.json side car. we could name it something like --external-root-sequence?

is_vcf

the previous version was conflating is_vcf with full_sequences for reasons unclear to me. We typically want to avoid storing full sequences when dealing with vcf data, but I didn't see why they should be synonymous.

huddlej · 2023-07-27T23:46:07Z

I cleaned up functional tests for augur ancestral (and fix them so they pass with the new output format for annotations) and then added one for the amino acid sequence reconstruction. In doing so for the Zika example data, I realized how painful it was to list 12 genes with 12 input files even without listing 12 output files for the reconstructed translations. Based on that experience, I'm more sold on the template string input/output approach that you had originally implemented, @rneher!

I also found a new (to me) error with TreeTime when I tried to run augur ancestral with just the 2K Zika gene's amino acid sequences which I've included below. My guess is that the issue is that all of the input amino acid sequences are identical. I'll create a separate TreeTime issue for this.

augur ancestral is using TreeTime version 0.11.0
Read in 2 features from reference sequence file
Processing gene: 2K
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: divide by zero encountered in divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: invalid value encountered in divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:10: RuntimeWarning: invalid value encountered in scalar subtract
  return (np.einsum('i,ij,j', pi, W, pi) - np.sum(pi*W[:,gap_index])*pi[gap_index])/(1-pi[gap_index])
Traceback (most recent call last):
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 274, in run
    aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 131, in run_ancestral
    tt = ancestral_sequence_inference(tree=T, aln=aln, ref=root_sequence if is_vcf else None, marginal=marginal,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 70, in ancestral_sequence_inference
    tt.infer_ancestral_sequences(infer_gtr=infer_gtr, marginal=bool_marginal,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 526, in infer_ancestral_sequences
    self.infer_gtr(marginal=marginal, **kwargs)
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 1497, in infer_gtr
    self._gtr = GTR.infer(n_ij, T_ia.sum(axis=-1), root_state, fixed_pi=fixed_pi, pc=pc,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py", line 550, in infer
    while LA.norm(pi_old-pi) > dp and count < Nit:
  File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/scipy/linalg/_misc.py", line 146, in norm
    a = np.asarray_chkfinite(a)
  File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/numpy/lib/function_base.py", line 628, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

rneher · 2023-07-28T07:08:14Z

thanks! We can revert back to the input template -- I don't have strong feelings. The check that the length of the annotation is consistent with the file is still useful, I think.

the TreeTime error is indeed from the identical sequences. There is nothing to estimate the GTR model from. I thought this had been regularized by pseudo counts.

huddlej · 2023-07-31T16:41:07Z

Thanks for reverting that, @rneher. I will add the option to write out translation FASTAs and update the functional test.

corneliusroemer

Tested the new annotation and root-sequence features and they worked for me - save the documentation issue with single vs double % in the template definition

CHANGES.md

augur/ancestral.py

corneliusroemer · 2023-08-02T12:16:57Z

augur/ancestral.py

+    parser.add_argument('--genes', nargs='+', help="genes to translate (list or file containing list)")
+    parser.add_argument('--translations', type=str, help="translated alignments for each CDS/Gene. "
+                           "Currently only supported for fasta-input. Specify the file name via a "
+                           "template like 'my_alignment_%%GENE.fasta' where %%GENE will be replaced "


@rneher When I tried to use the template with double %% I got an error:

ERROR: SequenceData: loading alignment failed... builds/21L/translations/aligned.gene.%ORF1a.fasta

The correct template string might currently require only a single percent sign - not sure whether this is an error in the documentation or the implementation.

When I used a single % things worked fine.

percent signs need to be escaped with another percent sign. augur ancestral -h shows it only with a single percent sign as intended.

I see, I've never used Python pre f-strings which make use of % for formatting unnecessary.

Adding an f or r before the string would allow us to not need to escape. But this is a matter of taste and not blocking.

corneliusroemer

Super useful for making Nextclade datasets, am already using it through pip install -e . until we've got this released.

codecov · 2023-08-03T20:43:28Z

Codecov Report

Patch coverage: 86.17% and project coverage change: +0.15% 🎉

Comparison is base (c4d07c7) 69.69% compared to head (7cfbd9b) 69.85%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1258      +/-   ##
==========================================
+ Coverage   69.69%   69.85%   +0.15%     
==========================================
  Files          67       67              
  Lines        7099     7155      +56     
  Branches     1726     1740      +14     
==========================================
+ Hits         4948     4998      +50     
- Misses       1846     1851       +5     
- Partials      305      306       +1

Files Changed	Coverage Δ
augur/ancestral.py	`75.14% <86.17%> (+7.00%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin · 2023-08-08T15:14:50Z

I split 9bde475...73e3cdd into #1275 to fix CI for other PRs. I force-pushed to rebase on the merged changes and drop those commits from here.

augur ancestral so far only dealt with nucleotide sequences, while translation were done codon by codon from the nucleotide sequences in later steps. If translations for tips are available, the ancestral amino acid sequences can be determined analogously via ancestral reconstruction. This is implemented in this commit. For each node, the node data structure will contain a `muts` and a `aa_muts` field. The latter is a dict with gene/cds names as keys and `<ancestral><pos><derived>` mutations (same as in augur translate). In addition, an annotation and reference sequences are derived stored in the node-data-json.

…notation and translation

Reorganize functional tests for augur ancestral to follow the standard pattern with "cram" and "data" subdirectories and separate cram files for individual functional tests. Paves the way for adding another functional test for amino acid sequence reconstruction.

Adds amino acid sequences for the corresponding nucleotide alignment in the ancestral data directory (created from augur translate but mimicking output from Nextalign) and adds a functional test for the new interface that allows reconstruction of amino acid sequences for internal nodes from the given tip sequences per gene.

This should fail.

Adds logic to save ancestral amino acid sequences to one FASTA file per gene and updates the functional test to reflect the expected number of internal nodes in these outputs.

Fixes a bug with VCF inputs for augur ancestral where the reference sequence is already a string, so it could not be converted to a string from a BioPython sequence as the existing code expected. This commit converts the BioPython Seq instance into a string after loading the root sequence, so we always work with a string. It also updates the docstring for `ancestral_sequence_inference` to include the "ref" argument.

Adds a functional test for the root sequence input of the new ancestral interface and fixes a bug revealed by this test where the reference sequence needs to be converted to a BioPython SeqIO.Seq instance prior to calling the feature extraction method on it. The feature extraction method accepts Seq and str inputs, but it will return an output of the same type. When we previously passed a string, the returned string didn't have the expected `translate` method.

Reorganizes arguments for ancestral subcommand into logical argument groups and adds validation and a functional test for new amino acid sequence inputs that must all be provided when any are.

Ignores the path to GFF/GenBank annotations in the JSON output for an augur ancestral functional test. This path will vary across platforms based on the working directory used.

Include link to PR, PR authors, and note about new translations output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow reconstruction of amino acid sequences #1258

feat: allow reconstruction of amino acid sequences #1258

rneher commented Jul 22, 2023 •

edited

Loading

rneher commented Jul 22, 2023

rneher commented Jul 22, 2023

rneher commented Jul 22, 2023

victorlin Jul 24, 2023

rneher Jul 25, 2023

jameshadfield Jul 25, 2023

jameshadfield Jul 24, 2023 •

edited

Loading

jameshadfield Jul 24, 2023

rneher Jul 25, 2023

rneher Jul 26, 2023

huddlej left a comment

rneher commented Jul 26, 2023 •

edited

Loading

huddlej commented Jul 27, 2023

rneher commented Jul 28, 2023

huddlej commented Jul 31, 2023

corneliusroemer left a comment

corneliusroemer Aug 2, 2023

rneher Aug 3, 2023 •

edited

Loading

corneliusroemer Aug 3, 2023

corneliusroemer left a comment

codecov bot commented Aug 3, 2023 •

edited

Loading

victorlin commented Aug 8, 2023

huddlej commented Aug 11, 2023

		anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']),
		'strand': '+', 'type': 'source'}}

feat: allow reconstruction of amino acid sequences #1258

feat: allow reconstruction of amino acid sequences #1258

Conversation

rneher commented Jul 22, 2023 • edited Loading

Checklist

rneher commented Jul 22, 2023

rneher commented Jul 22, 2023

rneher commented Jul 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huddlej left a comment

Choose a reason for hiding this comment

Big picture

Interface

Minor notes

rneher commented Jul 26, 2023 • edited Loading

huddlej commented Jul 27, 2023

rneher commented Jul 28, 2023

huddlej commented Jul 31, 2023

corneliusroemer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rneher Aug 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corneliusroemer left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 3, 2023 • edited Loading

Codecov Report

victorlin commented Aug 8, 2023

huddlej commented Aug 11, 2023

rneher commented Jul 22, 2023 •

edited

Loading

jameshadfield Jul 24, 2023 •

edited

Loading

rneher commented Jul 26, 2023 •

edited

Loading

rneher Aug 3, 2023 •

edited

Loading

codecov bot commented Aug 3, 2023 •

edited

Loading