Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ancestral sequence reconstruction for amino acid sequences #1054

Closed
corneliusroemer opened this issue Sep 20, 2022 · 1 comment · Fixed by #1258
Closed

Support ancestral sequence reconstruction for amino acid sequences #1054

corneliusroemer opened this issue Sep 20, 2022 · 1 comment · Fixed by #1258
Labels
enhancement New feature or request

Comments

@corneliusroemer
Copy link
Member

corneliusroemer commented Sep 20, 2022

Context

Early canonical Nextstrain workflows expected translations to be produced by augur translate after first running augur ancestral (e.g., the Zika workflow).

Nextalign can produce translations of aligned nucleotide sequences to amino acid sequences with one file per gene via --output-translations. Since these translations can be higher quality than those produced by augur translate and since augur ancestral only accepts nucleotide alignments as input, we have started to perform ancestral sequence reconstruction of amino acid sequences from Nextalign with custom scripts in ncov and seasonal flu workflows.

We considered two approaches to moving this ad hoc functionality into existing Augur subcommands including adding support for amino acid sequence inputs to either augur ancestral or augur translate.

After some synchronous discussion about these options, we decided that the augur ancestral approach would be most logically consistent with the existing interfaces, especially since the action we want to perform is ancestral sequence reconstruction and not translation.

Considerations

  • Since most workflows expect both nt_muts.json and aa_muts.json files, these workflows may break if we provide more than one amino acid mutations JSON, for example. Specifically, augur export may fail to properly merge annotations metadata present in separate JSONs into a single data structure.
  • Not all argument for the current augur ancestral interface will make sense for amino acid sequence inputs. We should consider grouping related and mutually exclusive arguments in the argparse interface.
  • Amino acid sequences will exist as one file per gene. We need to decide whether the ancestral interface should accept multiple sequence files at once or only one at a time.
    • Current pathogen-specific scripts for this functionality accept multiple inputs and produce a single node data JSON. This approach has a couple of benefits including avoiding the issue mentioned above about merging multiple aa mut JSON files during export and also avoiding workflow complexity of one-rule-per-gene.
  • Pathogen-specific scripts for ancestral amino acid sequence reconstruction create implicit FASTA outputs for each FASTA input. We should decide how to explicitly name these outputs, so it is clear to users that they exist. The existing augur ancestral interface provides a --output-sequences argument that could be used for this purpose, perhaps with the same template string approach used by Nextalign to produce one file per gene (--output-translations "aa_alignment_for_{gene}.fasta").
@corneliusroemer corneliusroemer added the enhancement New feature or request label Sep 20, 2022
@huddlej huddlej changed the title ENH: Allow use of nextalign in augur translate Support ancestral sequence reconstruction for amino acid sequences Jul 20, 2023
@huddlej
Copy link
Contributor

huddlej commented Jul 20, 2023

Current interface for augur ancestral:

  --tree TREE, -t TREE  prebuilt Newick with named internal nodes (default: None)
  --alignment ALIGNMENT, -a ALIGNMENT
                        alignment in fasta or VCF format (default: None)
  --output-node-data OUTPUT_NODE_DATA
                        name of JSON file to save mutations and ancestral sequences to (default: None)
  --output-sequences OUTPUT_SEQUENCES
                        name of FASTA file to save ancestral sequences to (FASTA alignments only) (default: None)
  --inference {joint,marginal}
                        calculate joint or marginal maximum likelihood ancestral sequence states (default: joint)
  --vcf-reference VCF_REFERENCE
                        fasta file of the sequence the VCF was mapped to (only used if a VCF is provided as the alignment) (default: None)
  --output-vcf OUTPUT_VCF
                        name of output VCF file which will include ancestral seqs (default: None)
  --keep-ambiguous      do not infer nucleotides at ambiguous (N) sites on tip sequences (leave as N). (default: False)
  --infer-ambiguous     infer nucleotides at ambiguous (N,W,R,..) sites on tip sequences and replace with most likely state. (default: True)
  --keep-overhangs      do not infer nucleotides for gaps (-) on either side of the alignment (default: False)

Proposed interface for augur ancestral with amino acid sequence support:

inputs:
  --tree TREE, -t TREE  prebuilt Newick (default: None)
  --alignment ALIGNMENT [ALIGNMENT], -a ALIGNMENT [ALIGNMENT]
                        one or more alignments in FASTA or VCF format (default: None)

  --inference {joint,marginal}
                        calculate joint or marginal maximum likelihood ancestral sequence states (default: joint)

nucleotide only:
  --vcf-reference VCF_REFERENCE
                        fasta file of the sequence the VCF was mapped to (only used if a VCF is provided as the alignment) (default: None)
  --keep-ambiguous      do not infer nucleotides at ambiguous (N) sites on tip sequences (leave as N). (default: False)
  --infer-ambiguous     infer nucleotides at ambiguous (N,W,R,..) sites on tip sequences and replace with most likely state. (default: True)
  --keep-overhangs      do not infer nucleotides for gaps (-) on either side of the alignment (default: False)

amino acid only:
  --genes GENES   list of gene names corresponding to each input amino acid alignment
  --gene-map GENE_MAP  GFF gene map with nucleotide coordinates for requested genes that need to be exported in the JSON output as annotations
  --reference REFERENCE FASTA for the nucleotide sequence of the reference used for alignment whose length is needed for annotations in JSON output

outputs:
  --output-node-data OUTPUT_NODE_DATA
                        name of JSON file to save mutations and ancestral sequences to (default: None)
  --output-sequences OUTPUT_SEQUENCES
                        name of FASTA file to save ancestral sequences to (FASTA alignments only).
                        For amino acid sequences, provide a template string to create one output file per gene with a `{gene}` placeholder like "aa_sequences_for_{gene}.fasta". (default: None)
  --output-vcf OUTPUT_VCF
                        (nucleotide only) name of output VCF file which will include ancestral seqs (default: None)

Note that ad hoc scripts require the reference sequence as input both to export the reference name as an annotation and to export the nucleotide sequence length of the reference. For the first case, we could instead rely on the id column of the GFF input which should use the same value as the FASTA file. For the second case, we could use the sequence-region pragma of the GFF input to get the coordinate range of the reference (e.g., record.annotations["sequence-region"]). With these approaches, we could drop the required --reference input for amino acid sequences above.

@huddlej huddlej linked a pull request Jul 24, 2023 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants