Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: add 'gene' feature capture when reading genbank files #1435

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Mar 8, 2024

Description of proposed changes

In response to nextstrain/rsv#55 (comment), adds a 'gene' feature capture when reading GenBank files.

Please feel free to commit changes to this branch, this initial commit was a draft.

We should add a test of some sort to make sure this works as expected, perhaps in augur/tests/io?

The potential benefit of this functionality is to simplify the newreference.py script for Nextstrain gene tree builds.

Related issue(s)

Checklist

  • Add test for the new functionality of capturing 'gene' features
  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

Copy link

codecov bot commented Mar 8, 2024

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 68.70%. Comparing base (7197b9f) to head (121e304).
Report is 17 commits behind head on master.

Files Patch % Lines
augur/utils.py 0.00% 12 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1435      +/-   ##
==========================================
+ Coverage   68.66%   68.70%   +0.03%     
==========================================
  Files          69       69              
  Lines        7554     7608      +54     
  Branches     1851     1866      +15     
==========================================
+ Hits         5187     5227      +40     
- Misses       2089     2098       +9     
- Partials      278      283       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -432,6 +434,21 @@ def _read_genbank(reference, feature_names):
if fname and (feature_names is None or fname in feature_names):
features[fname] = feat

if feat.type=='gene':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the chances of this introducing unexpected changes to current usage, I'd suggest an approach where we only use 'gene' features if the corresponding 'CDS' feature doesn't exist. There's a few different ways to implement this, but the simplest may be to use an extra loop through gb.features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing this discussion based on nextstrain/rsv#55 (comment):

Theoretically CDSs are what we want, but we have GenBank files which use "gene" instead. I can't think of a situation where we want to extract both from the same genbank reference¹. One option is to look for 'gene' only if no CDSs are found. Another is to expose this as a command line argument to augur.

¹ I'm sure there are exceptions to this, in which case I think the onus is on the user to edit the GenBank file upstream of Augur.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something where we need to poll users to vote for "CDS" or "gene"? Or it sounds like we're pretty sure it's "CDS"?

Another is to expose this as a command line argument to augur.

Ah, I could get behind having a command line argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sure we want to use CDS, this is what we're moving towards as only CDSes allow complex annotations like ribosomal slippage etc. Nextclade v3 uses CDS moving forward, we've made similar changes for auspice

@j23414 j23414 mentioned this pull request Apr 30, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants