Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: add 'gene' feature capture when reading genbank files #1435

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions augur/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,17 +385,18 @@ def _read_nuc_annotation_from_genbank(record, reference):

def _read_genbank(reference, feature_names):
"""
Read a GenBank file. We only read GenBank feature keys 'CDS' or 'source'.
Read a GenBank file. We only read GenBank feature keys 'source', 'CDS', or 'gene'.
We create a "feature name" via:
- for 'source' features use 'nuc'
- for 'CDS' features use the locus_tag or the gene. If neither, then silently ignore.
- for 'gene' feature use the locus_tag or the gene. If neither, then silently ignore.

Parameters
----------
reference : string
File path to GenBank reference
feature_names : None or set or list
Restrict the CDSs we read to those in the set/list
Restrict the CDSs or genes we read to those in the set/list

Returns
-------
Expand All @@ -408,6 +409,7 @@ def _read_genbank(reference, feature_names):
AugurError
If 'nuc' annotation not parsed
If a CDS feature is given the name 'nuc'
If a 'gene' feature is given the name 'nuc'
"""
from Bio import SeqIO
gb = SeqIO.read(reference, 'genbank')
Expand All @@ -432,6 +434,24 @@ def _read_genbank(reference, feature_names):
if fname and (feature_names is None or fname in feature_names):
features[fname] = feat

if feat.type=='gene':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the chances of this introducing unexpected changes to current usage, I'd suggest an approach where we only use 'gene' features if the corresponding 'CDS' feature doesn't exist. There's a few different ways to implement this, but the simplest may be to use an extra loop through gb.features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing this discussion based on nextstrain/rsv#55 (comment):

Theoretically CDSs are what we want, but we have GenBank files which use "gene" instead. I can't think of a situation where we want to extract both from the same genbank reference¹. One option is to look for 'gene' only if no CDSs are found. Another is to expose this as a command line argument to augur.

¹ I'm sure there are exceptions to this, in which case I think the onus is on the user to edit the GenBank file upstream of Augur.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something where we need to poll users to vote for "CDS" or "gene"? Or it sounds like we're pretty sure it's "CDS"?

Another is to expose this as a command line argument to augur.

Ah, I could get behind having a command line argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sure we want to use CDS, this is what we're moving towards as only CDSes allow complex annotations like ribosomal slippage etc. Nextclade v3 uses CDS moving forward, we've made similar changes for auspice

fname = None
if "locus_tag" in feat.qualifiers:
fname = feat.qualifiers["locus_tag"][0]
elif "gene" in feat.qualifiers:
fname = feat.qualifiers["gene"][0]
else:
features_skipped+=1

if fname == 'nuc':
raise AugurError(f"Reference {reference!r} contains a 'gene' feature with the name 'nuc'. This is not allowed.")

if fname and (feature_names is None or fname in feature_names):
if fname not in features:
features[fname] = feat
else:
print(f"WARNING: gene: '{fname}' already exists as a CDS: '{fname}'. Skipping this feature.")

if features_skipped:
print(f"WARNING: {features_skipped} CDS features skipped as they didn't have a locus_tag or gene qualifier.")

Expand Down
Loading