DRAFT: add 'gene' feature capture when reading genbank files #1435

j23414 · 2024-03-08T22:38:17Z

Description of proposed changes

In response to nextstrain/rsv#55 (comment), adds a 'gene' feature capture when reading GenBank files.

Please feel free to commit changes to this branch, this initial commit was a draft.

We should add a test of some sort to make sure this works as expected, perhaps in augur/tests/io?

The potential benefit of this functionality is to simplify the newreference.py script for Nextstrain gene tree builds.

Related issue(s)

Checklist

Add test for the new functionality of capturing 'gene' features
Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2024-03-08T22:48:04Z

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 68.70%. Comparing base (7197b9f) to head (121e304).
Report is 17 commits behind head on master.

Files	Patch %	Lines
augur/utils.py	0.00%	12 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1435      +/-   ##
==========================================
+ Coverage   68.66%   68.70%   +0.03%     
==========================================
  Files          69       69              
  Lines        7554     7608      +54     
  Branches     1851     1866      +15     
==========================================
+ Hits         5187     5227      +40     
- Misses       2089     2098       +9     
- Partials      278      283       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jameshadfield · 2024-03-10T20:40:19Z

augur/utils.py

@@ -432,6 +434,21 @@ def _read_genbank(reference, feature_names):
            if fname and (feature_names is None or fname in feature_names):
                features[fname] = feat

+        if feat.type=='gene':


To reduce the chances of this introducing unexpected changes to current usage, I'd suggest an approach where we only use 'gene' features if the corresponding 'CDS' feature doesn't exist. There's a few different ways to implement this, but the simplest may be to use an extra loop through gb.features.

Continuing this discussion based on nextstrain/rsv#55 (comment):

Theoretically CDSs are what we want, but we have GenBank files which use "gene" instead. I can't think of a situation where we want to extract both from the same genbank reference¹. One option is to look for 'gene' only if no CDSs are found. Another is to expose this as a command line argument to augur.

¹ I'm sure there are exceptions to this, in which case I think the onus is on the user to edit the GenBank file upstream of Augur.

Is this something where we need to poll users to vote for "CDS" or "gene"? Or it sounds like we're pretty sure it's "CDS"?

Another is to expose this as a command line argument to augur.

Ah, I could get behind having a command line argument

We're sure we want to use CDS, this is what we're moving towards as only CDSes allow complex annotations like ribosomal slippage etc. Nextclade v3 uses CDS moving forward, we've made similar changes for auspice

wip: add 'gene' feature capture when reading genbank files

c357955

jameshadfield mentioned this pull request Mar 10, 2024

Allows for CDS (as well as gene) features to generate a new gene reference nextstrain/rsv#55

Merged

1 task

jameshadfield reviewed Mar 10, 2024

View reviewed changes

wip: prefer CDS over gene

121e304

j23414 mentioned this pull request Apr 30, 2024

Add E gene trees nextstrain/dengue#18

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: add 'gene' feature capture when reading genbank files #1435

DRAFT: add 'gene' feature capture when reading genbank files #1435

j23414 commented Mar 8, 2024

codecov bot commented Mar 8, 2024 •

edited

Loading

jameshadfield Mar 10, 2024

jameshadfield Mar 21, 2024

j23414 Mar 22, 2024

corneliusroemer Apr 2, 2024

DRAFT: add 'gene' feature capture when reading genbank files #1435

Are you sure you want to change the base?

DRAFT: add 'gene' feature capture when reading genbank files #1435

Conversation

j23414 commented Mar 8, 2024

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Mar 8, 2024 • edited Loading

Codecov Report

jameshadfield Mar 10, 2024

Choose a reason for hiding this comment

jameshadfield Mar 21, 2024

Choose a reason for hiding this comment

j23414 Mar 22, 2024

Choose a reason for hiding this comment

corneliusroemer Apr 2, 2024

Choose a reason for hiding this comment

codecov bot commented Mar 8, 2024 •

edited

Loading