Support ancestral sequence reconstruction for amino acid sequences #1054

corneliusroemer · 2022-09-20T20:36:44Z

Context

Early canonical Nextstrain workflows expected translations to be produced by augur translate after first running augur ancestral (e.g., the Zika workflow).

Nextalign can produce translations of aligned nucleotide sequences to amino acid sequences with one file per gene via --output-translations. Since these translations can be higher quality than those produced by augur translate and since augur ancestral only accepts nucleotide alignments as input, we have started to perform ancestral sequence reconstruction of amino acid sequences from Nextalign with custom scripts in ncov and seasonal flu workflows.

We considered two approaches to moving this ad hoc functionality into existing Augur subcommands including adding support for amino acid sequence inputs to either augur ancestral or augur translate.

After some synchronous discussion about these options, we decided that the augur ancestral approach would be most logically consistent with the existing interfaces, especially since the action we want to perform is ancestral sequence reconstruction and not translation.

Considerations

Since most workflows expect both nt_muts.json and aa_muts.json files, these workflows may break if we provide more than one amino acid mutations JSON, for example. Specifically, augur export may fail to properly merge annotations metadata present in separate JSONs into a single data structure.
Not all argument for the current augur ancestral interface will make sense for amino acid sequence inputs. We should consider grouping related and mutually exclusive arguments in the argparse interface.
Amino acid sequences will exist as one file per gene. We need to decide whether the ancestral interface should accept multiple sequence files at once or only one at a time.
- Current pathogen-specific scripts for this functionality accept multiple inputs and produce a single node data JSON. This approach has a couple of benefits including avoiding the issue mentioned above about merging multiple aa mut JSON files during export and also avoiding workflow complexity of one-rule-per-gene.
Pathogen-specific scripts for ancestral amino acid sequence reconstruction create implicit FASTA outputs for each FASTA input. We should decide how to explicitly name these outputs, so it is clear to users that they exist. The existing augur ancestral interface provides a --output-sequences argument that could be used for this purpose, perhaps with the same template string approach used by Nextalign to produce one file per gene (--output-translations "aa_alignment_for_{gene}.fasta").

The text was updated successfully, but these errors were encountered:

huddlej · 2023-07-20T20:59:41Z

Current interface for augur ancestral:

  --tree TREE, -t TREE  prebuilt Newick with named internal nodes (default: None)
  --alignment ALIGNMENT, -a ALIGNMENT
                        alignment in fasta or VCF format (default: None)
  --output-node-data OUTPUT_NODE_DATA
                        name of JSON file to save mutations and ancestral sequences to (default: None)
  --output-sequences OUTPUT_SEQUENCES
                        name of FASTA file to save ancestral sequences to (FASTA alignments only) (default: None)
  --inference {joint,marginal}
                        calculate joint or marginal maximum likelihood ancestral sequence states (default: joint)
  --vcf-reference VCF_REFERENCE
                        fasta file of the sequence the VCF was mapped to (only used if a VCF is provided as the alignment) (default: None)
  --output-vcf OUTPUT_VCF
                        name of output VCF file which will include ancestral seqs (default: None)
  --keep-ambiguous      do not infer nucleotides at ambiguous (N) sites on tip sequences (leave as N). (default: False)
  --infer-ambiguous     infer nucleotides at ambiguous (N,W,R,..) sites on tip sequences and replace with most likely state. (default: True)
  --keep-overhangs      do not infer nucleotides for gaps (-) on either side of the alignment (default: False)

Proposed interface for augur ancestral with amino acid sequence support:

inputs:
  --tree TREE, -t TREE  prebuilt Newick (default: None)
  --alignment ALIGNMENT [ALIGNMENT], -a ALIGNMENT [ALIGNMENT]
                        one or more alignments in FASTA or VCF format (default: None)

  --inference {joint,marginal}
                        calculate joint or marginal maximum likelihood ancestral sequence states (default: joint)

nucleotide only:
  --vcf-reference VCF_REFERENCE
                        fasta file of the sequence the VCF was mapped to (only used if a VCF is provided as the alignment) (default: None)
  --keep-ambiguous      do not infer nucleotides at ambiguous (N) sites on tip sequences (leave as N). (default: False)
  --infer-ambiguous     infer nucleotides at ambiguous (N,W,R,..) sites on tip sequences and replace with most likely state. (default: True)
  --keep-overhangs      do not infer nucleotides for gaps (-) on either side of the alignment (default: False)

amino acid only:
  --genes GENES   list of gene names corresponding to each input amino acid alignment
  --gene-map GENE_MAP  GFF gene map with nucleotide coordinates for requested genes that need to be exported in the JSON output as annotations
  --reference REFERENCE FASTA for the nucleotide sequence of the reference used for alignment whose length is needed for annotations in JSON output

outputs:
  --output-node-data OUTPUT_NODE_DATA
                        name of JSON file to save mutations and ancestral sequences to (default: None)
  --output-sequences OUTPUT_SEQUENCES
                        name of FASTA file to save ancestral sequences to (FASTA alignments only).
                        For amino acid sequences, provide a template string to create one output file per gene with a `{gene}` placeholder like "aa_sequences_for_{gene}.fasta". (default: None)
  --output-vcf OUTPUT_VCF
                        (nucleotide only) name of output VCF file which will include ancestral seqs (default: None)

Note that ad hoc scripts require the reference sequence as input both to export the reference name as an annotation and to export the nucleotide sequence length of the reference. For the first case, we could instead rely on the id column of the GFF input which should use the same value as the FASTA file. For the second case, we could use the sequence-region pragma of the GFF input to get the coordinate range of the reference (e.g., record.annotations["sequence-region"]). With these approaches, we could drop the required --reference input for amino acid sequences above.

corneliusroemer added the enhancement New feature or request label Sep 20, 2022

huddlej changed the title ~~ENH: Allow use of nextalign in augur translate~~ Support ancestral sequence reconstruction for amino acid sequences Jul 20, 2023

huddlej linked a pull request Jul 24, 2023 that will close this issue

feat: allow reconstruction of amino acid sequences #1258

Merged

1 task

huddlej closed this as completed in #1258 Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ancestral sequence reconstruction for amino acid sequences #1054

Support ancestral sequence reconstruction for amino acid sequences #1054

corneliusroemer commented Sep 20, 2022 •

edited by huddlej

Loading

huddlej commented Jul 20, 2023 •

edited

Loading

Support ancestral sequence reconstruction for amino acid sequences #1054

Support ancestral sequence reconstruction for amino acid sequences #1054

Comments

corneliusroemer commented Sep 20, 2022 • edited by huddlej Loading

Context

Considerations

huddlej commented Jul 20, 2023 • edited Loading

corneliusroemer commented Sep 20, 2022 •

edited by huddlej

Loading

huddlej commented Jul 20, 2023 •

edited

Loading