Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: complex CDSs within augur ancestral #1333

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

Auspice now supports multi-part CDS annotations
This PR makes augur translate output the correct annotation for complex CDSs

Complex CDSs are detected through type(feat.location) being SeqFeature.CompountLocation

Currently, this only works with genbank files, as our load_features utility function does not
deal with compound CDS in GFF files correctly

I tested this with MERSs complex ORF1ab

Partially resolves #1280

Description of proposed changes

Related issue(s)

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

Auspice now supports multi-part CDS annotations
This PR makes `augur translate` output the correct annotation for complex CDSs

Complex CDSs are detected through `type(feat.location)` being `SeqFeature.CompountLocation`

Currently, this only works with genbank files, as our load_features utility function does not
deal with compound CDS in GFF files correctly

I tested this with MERSs complex ORF1ab

Partially resolves #1280
Copy link

codecov bot commented Nov 4, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Files Coverage Δ
augur/ancestral.py 73.88% <28.57%> (-1.84%) ⬇️

📢 Thoughts on this report? Let us know!.

@jameshadfield
Copy link
Member

This PR makes augur translate output the correct annotation for complex CDSs

I think you mean augur ancestral? Changes here are to ancestral.py and as far as I can tell translate doesn't import from ancestral.

I'd be happy to test this out - would you mind making available a test dataset?

jameshadfield added a commit that referenced this pull request Mar 15, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs are not
modified in this commit.

Supersedes #1333
jameshadfield added a commit that referenced this pull request Mar 15, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
jameshadfield added a commit that referenced this pull request Mar 16, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
@jameshadfield
Copy link
Member

I think this is now done via #1438 but perhaps there are cases not covered. We (I?) should write docs.

@huddlej
Copy link
Contributor

huddlej commented Sep 26, 2024

@jameshadfield #1438 did not fix the issue for my use case trying to implement an "epitope sites" gene annotation. Since epitope sites are not contiguous, I needed to define them like this (just showing the first two Koel et al. sites for H3N2 HA as an example here):

##gff-version 3
##sequence-region CY163680.1 1 1737
CY163680.1	feature	gene	18	65	.	+	.	gene_name="SigPep"
CY163680.1	feature	gene	66	1052	.	+	.	gene_name="HA1"
CY163680.1	feature	gene	501	533	.	+	.	gene=Koel;ID=gene-Koel
CY163680.1	feature	CDS	501	503	.	+	.	gene=Koel;ID=cds-Koel;Parent=gene-Koel
CY163680.1	feature	CDS	531	533	.	+	.	gene=Koel;ID=cds-Koel;Parent=gene-Koel
CY163680.1	feature	gene	1053	1715	.	+	.	gene_name="HA2"

This file is based on the example SARS-CoV-2 gene map from the Nextclade docs. Nextclade properly parses this GFF file and exports the correct translations for these two sites, but augur ancestral fails to load the GFF, because it tests whether the length of the amino acid sequences from TreeTime is 3 times the length of the full gene feature.

@corneliusroemer's implementation here seems like the right direction, but it also doesn't work for me with BioPython 1.84 an bcbio-gff 0.7.1 because the compound location gets parsed as a SimpleLocation with multiple subfeatures and the _read_gff function doesn't include CDS as a valid type. To bypass the length validation error locally, I had to make the following changes:

diff --git a/augur/ancestral.py b/augur/ancestral.py
index 932d8607..23e2b2b4 100644
--- a/augur/ancestral.py
+++ b/augur/ancestral.py
@@ -449,7 +449,13 @@ def run(args):

             aa_result = run_ancestral(T, fname, reference_sequence=reference_sequence, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
                                         marginal=args.inference, infer_ambiguous=infer_ambiguous, alphabet='aa')
-            if aa_result['tt'].data.full_length*3 != len(feat):
+
+            if len(feat.sub_features) > 0:
+                feature_length = sum(len(sub_feat) for sub_feat in feat.sub_features)
+            else:
+                feature_length = len(feat)
+
+            if (aa_result['tt'].data.full_length * 3) != feature_length:
                 raise AugurError(f"length of translated alignment for {gene} does not match length of reference feature."
                        " Please make sure that the annotation matches the translations.")

diff --git a/augur/utils.py b/augur/utils.py
index d794d256..827f23d6 100644
--- a/augur/utils.py
+++ b/augur/utils.py
@@ -289,7 +289,7 @@ def _read_gff(reference, feature_names):
         If a gene is found with the name 'nuc'
     """
     from BCBio import GFF
-    valid_types = ['gene', 'source', 'region']
+    valid_types = ['gene', 'source', 'region', 'CDS']
     features = {}

     with open_file(reference) as in_handle:

These changes alone do not fix the issue, though, since the downstream ancestral logic needs to know how to export multiple subfeatures as annotations instead of the full gene.

Before working on this more, though, I wonder if I'm thinking about the GFF implementation incorrectly? Would you approach this problem in a different way, @jameshadfield?

@jameshadfield jameshadfield changed the title feat: output genome annotations for complex CDS feat: complex CDSs within augur ancestral Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Genome annotations for segmented CDSs etc
3 participants