Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow reconstruction of amino acid sequences #1258

Merged
merged 23 commits into from
Aug 11, 2023
Merged

Conversation

rneher
Copy link
Member

@rneher rneher commented Jul 22, 2023

augur ancestral so far only dealt with nucleotide sequences, while translation were done codon by codon from the nucleotide sequences in later steps. If translations for tips are available, the ancestral amino acid sequences can be determined analogously via ancestral reconstruction. This is implemented in this commit.

For each node, the node data structure will contain a muts and a aa_muts field. The latter is a dict with gene/cds names as keys and <ancestral><pos><derived> mutations (same as in augur translate). In addition, an annotation and reference sequences are derived stored in the node-data-json.

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@rneher
Copy link
Member Author

rneher commented Jul 22, 2023

Pathogen workflows are currently failing because the node data jsons this produces doesn't contain the reconstructed sequences. I'll re-enable that.

@rneher
Copy link
Member Author

rneher commented Jul 22, 2023

augur ancestral has a whole bunch of conditions and flags that are passed around and I am not sure anybody fully see through what they do and why there were put in place. There is is_vcf which is fairly obvious, but the vcf functionality is not used much and not well tested. Then there is the infer_ambiguous which used to determine how the mask is set. But I don't quite understand the rational.

@rneher
Copy link
Member Author

rneher commented Jul 22, 2023

tests now fail with

@@ -21,9 +21,9 @@
       "nuc": {
         "end": 10769,
         "start": 1,
-        "strand": "+"
+        "strand": "+",
+        "type": "source"
       }
-    },

which I consider a problem with the tests...

@rneher rneher requested a review from victorlin July 24, 2023 09:13
@huddlej huddlej linked an issue Jul 24, 2023 that may be closed by this pull request
Comment on lines +259 to +290
anc_seqs['annotations'] = {'nuc': {'start': 1, 'end': len(anc_seqs['reference']['nuc']),
'strand': '+', 'type': 'source'}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of type: source?

This change is the reason for the failing tests. The test can be easily changed to pass again, but I want to understand the reasoning first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source is the name of the sequence with respect to which annotations in genbank files are made. If you look at this file for example you see a source field at the beginning of the annotation block. In GFF these are called landmarks. In augur translate we assign this also for the feature type source:

https://github.com/nextstrain/augur/blob/master/augur/translate.py#L395-L400

I added this here just in analogy to translate, I don't think auspice requires it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While Auspice doesn't use this, the "type" key (with string value) is specified in our node-data JSON annotations schema and exported (auspice) JSON schema so tests shouldn't fail.

if not is_vcf and args.genes:
from .utils import load_features
## load features; only requested features if genes given
features = load_features(args.annotation, args.genes)
Copy link
Member

@jameshadfield jameshadfield Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it's fair to lump it into this PR but our annotations parsing needs a lot of work.

Testing this PR (for HepB) with NC_003977.gb (straight from GenBank):

File "***/augur/augur/ancestral.py", line 272, in run
feat = features[gene]
KeyError: 'X'

Testing with the nextclade GFF for the same genome:

Couldn't find gene X in GFF or GenBank file
Couldn't find gene pre-capsid in GFF or GenBank file
Couldn't find gene capsid in GFF or GenBank file
Couldn't find gene pol in GFF or GenBank file
Couldn't find gene envL in GFF or GenBank file
Couldn't find gene envM in GFF or GenBank file
Couldn't find gene envS in GFF or GenBank file
Read in 0 features from reference sequence file
Processing gene: X
Traceback (most recent call last):
  File "/Users/enigma/projects/nextstrain/augur/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
  File "/Users/enigma/projects/nextstrain/augur/augur/ancestral.py", line 272, in run
    feat = features[gene]
KeyError: 'X'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following GFF file works. However as this is intended to be run in conjunction with Nextclade we should really sort the parsing out!

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_003977.2 1 3182
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10407
NC_003977.2	RefSeq	region	1	3182	.	+	.	ID=NC_003977.2:1..3182;Dbxref=taxon:10407;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=ayw
NC_003977.2	RefSeq	gene	1376	1840	.	+	0	gene=X;
NC_003977.2	RefSeq	gene	1816	2454	.	+	0	gene=pre-capsid;
NC_003977.2	RefSeq	gene	1903	2454	.	+	0	gene=capsid;
NC_003977.2	RefSeq	gene	2309	4807	.	+	0	gene=pol;
NC_003977.2	RefSeq	gene	2850	4019	.	+	0	gene=envL;
NC_003977.2	RefSeq	gene	3174	4019	.	+	0	gene=envM;
NC_003977.2	RefSeq	gene	157	837	.	+	0	gene=envS;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is a bit tricky. Our parsing is first limiting itself to source and gene, and then ignores "subfeatures". The GFF parser parses CDS that are children of a gene into feat.subfeatures. So yes, this is something we need to tackle.

What we want load_features to do is to return a source/landmark feature and then all the bits that need translating, probably best as type='CDS' rather than type='gene'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a simpler parser for the nextclade json here:
#1263

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this, @rneher! Below are some initial thoughts from playing with the new interface. I have only tested the new functionality in an ad hoc way from the command line. I'll spend a little more time tomorrow playing with its integration into a workflow. I can mockup functional tests once we settle the interface.

Big picture

I notice that the interface here allows users to reconstruct nucleotide and amino acid sequences in the same command. I had originally been thinking of this new functionality as a drop-in replacement for the standalone scripts like translations_aamut.py so our workflows would continue to have an nt_muts.json and an aa_muts.json. The interface in this PR makes more sense, I think, although we’ll have to be careful in our documentation and tutorials to clarify when you need two rules with augur ancestral and translate (e.g., Zika) vs. when you need one rule with augur ancestral. When we only use one call to ancestral, should we standardize the output filename to just muts.json? I guess we’ll eventually have most pathogens using Nextalign, so maybe there won’t be much confusion in that future.

Interface

For the command line options, I wish the --translations argument accepted a list of specific files instead of a template string. Even it is a bit of extra typing for users, it is clearer to list the specific files and ties into workflow managers better as a dynamic list of inputs. I guess the list becomes a pain for SARS-CoV-2 and other pathogens with a lot of genes, but only when manually running commands from the terminal.

I also wish that the new interface including an argument for --output-translations akin to augur translate’s --alignment-output argument, so we could have reconstructed translations for internal nodes in FASTA format. We use those outputs in multiple places in the seasonal flu workflow, for example. The corresponding argument in translate accepts a template string like 'my_alignment_%GENE.fasta', but I slightly prefer the way the ncov workflow explicitly lists the output files with internal node translations even though the explicit_translation script creates those outputs implicitly. This approach allows you to explicitly reference output FASTA files from other rules. An example command from the seasonal flu repo directory for H3N2 HA would be:

augur ancestral \
  --tree tree.nwk \
  --alignment aligned.fasta \
  --annotation ../../../config/h3n2/ha/genemap.gff \
  --genes SigPep HA1 HA2 \
  --translations nextalign/masked.gene.SigPep.fasta \
                 nextalign/masked.gene.HA1.fasta \
                 nextalign/masked.gene.HA2.fasta \
  --output-node-data muts.json \
  --output-sequences nt_muts.fasta \ # Not currently used in flu workflow
  --output-translations aa_muts.SigPep.fasta \
                        aa_muts.HA1.fasta \
                        aa_muts.HA2.fasta

The new ability to annotate mutations on the branch leading to the inferred root of the tree using the --root-sequence argument is really cool! Does this argument refer to “root” as in the root of the tree? It seems more like the “reference” of the alignment which could also be the root of the tree but doesn’t have to be.

Minor notes

I’d like to add some functional tests for the new interface, once we agree on what that should be (based on discussion on the above).

We’ll need to update the docstring of the ancestral.py module to refer to amino acid sequences in addition to nucleotide sequences.

As noted above, arguments need grouping into different sections and checks for mutually required arguments.

As a very minor note, I wish that internal functions run_ancestral and collect_mutations_and_sequences didn’t need to pass around the state of the input being VCF or not (is_vcf). That the input is VCF feels like a technical detail for the command line interface that affects downstream steps in different ways that need their own flags. For example, I found the original interface for collect_mutations_and_sequences with the mask_ambiguous argument much clearer to read. The calling function(s) could determine whether being a VCF means masking ambiguous characters and that would decouple the internal functions from the details of the input file format. The run_ancestral function could accept an argument for mask_ambiguous that it passes on to the lower level function and it could also accept a boolean argument like use_root_as_reference that would fill the place of is_vcf in the ref=root_sequence if is_vcf else None expression. (Whew! The least important note required the most text to describe…)

@rneher
Copy link
Member Author

rneher commented Jul 26, 2023

Thanks @huddlej . A few follow up comments:

nucleotide and amino acid sequences in the same command

my idea here was to pick up the nextalign output where both alignments are available.

I wish the --translations argument accepted a list of specific files

your wish might be granted :) easy change. I had just copied the corresponding (output) parts in translate.py. But we then have to rely on --genes and --translations to be in the same order AND match whatever is in the annotation. In consistencies can be spotted by comparing the length of the annotation with the length of alignment.
UPDATE: changed to be list of files.

--output-translations ...

easy to add -- mostly a question about interface. I was thinking of removing the full ancestral sequences from the node data json.

root vs reference sequence

I agree that "reference" is more natural in this context, but I worried about it being ubiquitously used across workflows for various purposes, most of which don't have anything to do with "mutations". It has, on the other hand, exactly the same purpose as root-sequence.json side car. we could name it something like --external-root-sequence?

is_vcf

the previous version was conflating is_vcf with full_sequences for reasons unclear to me. We typically want to avoid storing full sequences when dealing with vcf data, but I didn't see why they should be synonymous.

@huddlej
Copy link
Contributor

huddlej commented Jul 27, 2023

I cleaned up functional tests for augur ancestral (and fix them so they pass with the new output format for annotations) and then added one for the amino acid sequence reconstruction. In doing so for the Zika example data, I realized how painful it was to list 12 genes with 12 input files even without listing 12 output files for the reconstructed translations. Based on that experience, I'm more sold on the template string input/output approach that you had originally implemented, @rneher!

I also found a new (to me) error with TreeTime when I tried to run augur ancestral with just the 2K Zika gene's amino acid sequences which I've included below. My guess is that the issue is that all of the input amino acid sequences are identical. I'll create a separate TreeTime issue for this.

augur ancestral is using TreeTime version 0.11.0
Read in 2 features from reference sequence file
Processing gene: 2K
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: divide by zero encountered in divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: invalid value encountered in divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:10: RuntimeWarning: invalid value encountered in scalar subtract
  return (np.einsum('i,ij,j', pi, W, pi) - np.sum(pi*W[:,gap_index])*pi[gap_index])/(1-pi[gap_index])
Traceback (most recent call last):
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 274, in run
    aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 131, in run_ancestral
    tt = ancestral_sequence_inference(tree=T, aln=aln, ref=root_sequence if is_vcf else None, marginal=marginal,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 70, in ancestral_sequence_inference
    tt.infer_ancestral_sequences(infer_gtr=infer_gtr, marginal=bool_marginal,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 526, in infer_ancestral_sequences
    self.infer_gtr(marginal=marginal, **kwargs)
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 1497, in infer_gtr
    self._gtr = GTR.infer(n_ij, T_ia.sum(axis=-1), root_state, fixed_pi=fixed_pi, pc=pc,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py", line 550, in infer
    while LA.norm(pi_old-pi) > dp and count < Nit:
  File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/scipy/linalg/_misc.py", line 146, in norm
    a = np.asarray_chkfinite(a)
  File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/numpy/lib/function_base.py", line 628, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

@rneher
Copy link
Member Author

rneher commented Jul 28, 2023

thanks! We can revert back to the input template -- I don't have strong feelings. The check that the length of the annotation is consistent with the file is still useful, I think.

the TreeTime error is indeed from the identical sequences. There is nothing to estimate the GTR model from. I thought this had been regularized by pseudo counts.

@huddlej
Copy link
Contributor

huddlej commented Jul 31, 2023

Thanks for reverting that, @rneher. I will add the option to write out translation FASTAs and update the functional test.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the new annotation and root-sequence features and they worked for me - save the documentation issue with single vs double % in the template definition

CHANGES.md Outdated Show resolved Hide resolved
augur/ancestral.py Outdated Show resolved Hide resolved
parser.add_argument('--genes', nargs='+', help="genes to translate (list or file containing list)")
parser.add_argument('--translations', type=str, help="translated alignments for each CDS/Gene. "
"Currently only supported for fasta-input. Specify the file name via a "
"template like 'my_alignment_%%GENE.fasta' where %%GENE will be replaced "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rneher When I tried to use the template with double %% I got an error:

ERROR: SequenceData: loading alignment failed... builds/21L/translations/aligned.gene.%ORF1a.fasta

The correct template string might currently require only a single percent sign - not sure whether this is an error in the documentation or the implementation.

When I used a single % things worked fine.

Copy link
Member Author

@rneher rneher Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

percent signs need to be escaped with another percent sign. augur ancestral -h shows it only with a single percent sign as intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I've never used Python pre f-strings which make use of % for formatting unnecessary.

Adding an f or r before the string would allow us to not need to escape. But this is a matter of taste and not blocking.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super useful for making Nextclade datasets, am already using it through pip install -e . until we've got this released.

@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Patch coverage: 86.17% and project coverage change: +0.15% 🎉

Comparison is base (c4d07c7) 69.69% compared to head (7cfbd9b) 69.85%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1258      +/-   ##
==========================================
+ Coverage   69.69%   69.85%   +0.15%     
==========================================
  Files          67       67              
  Lines        7099     7155      +56     
  Branches     1726     1740      +14     
==========================================
+ Hits         4948     4998      +50     
- Misses       1846     1851       +5     
- Partials      305      306       +1     
Files Changed Coverage Δ
augur/ancestral.py 75.14% <86.17%> (+7.00%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin
Copy link
Member

I split 9bde475...73e3cdd into #1275 to fix CI for other PRs. I force-pushed to rebase on the merged changes and drop those commits from here.

rneher and others added 9 commits August 11, 2023 11:30
augur ancestral so far only dealt with nucleotide sequences, while
translation were done codon by codon from the nucleotide sequences in
later steps. If translations for tips are available, the ancestral
amino acid sequences can be determined analogously via ancestral
reconstruction. This is implemented in this commit.

For each node, the node data structure will contain a `muts` and a
`aa_muts` field. The latter is a dict with gene/cds names as keys and
`<ancestral><pos><derived>` mutations (same as in augur translate).
In addition, an annotation and reference sequences are derived stored
in the node-data-json.
huddlej and others added 14 commits August 11, 2023 11:31
Reorganize functional tests for augur ancestral to follow the standard
pattern with "cram" and "data" subdirectories and separate cram files
for individual functional tests. Paves the way for adding another
functional test for amino acid sequence reconstruction.
Adds amino acid sequences for the corresponding nucleotide alignment in
the ancestral data directory (created from augur translate but mimicking
output from Nextalign) and adds a functional test for the new interface
that allows reconstruction of amino acid sequences for internal nodes
from the given tip sequences per gene.
Adds logic to save ancestral amino acid sequences to one FASTA file per
gene and updates the functional test to reflect the expected number of
internal nodes in these outputs.
Fixes a bug with VCF inputs for augur ancestral where the reference
sequence is already a string, so it could not be converted to a string
from a BioPython sequence as the existing code expected. This commit
converts the BioPython Seq instance into a string after loading the root
sequence, so we always work with a string. It also updates the docstring
for `ancestral_sequence_inference` to include the "ref" argument.
Adds a functional test for the root sequence input of the new ancestral
interface and fixes a bug revealed by this test where the reference
sequence needs to be converted to a BioPython SeqIO.Seq instance prior
to calling the feature extraction method on it. The feature extraction
method accepts Seq and str inputs, but it will return an output of the
same type. When we previously passed a string, the returned string
didn't have the expected `translate` method.
Reorganizes arguments for ancestral subcommand into logical argument
groups and adds validation and a functional test for new amino acid
sequence inputs that must all be provided when any are.
Ignores the path to GFF/GenBank annotations in the JSON output for an
augur ancestral functional test. This path will vary across platforms
based on the working directory used.
Include link to PR, PR authors, and note about new translations output.
@huddlej
Copy link
Contributor

huddlej commented Aug 11, 2023

Force-pushed to fix the merge conflict. This should be good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Support ancestral sequence reconstruction for amino acid sequences
5 participants