-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port translate-genbank-location
to augur curate parse-genbank-location
[#1485]
#1508
Conversation
* Convert script over to expected sub-command style * Add `--location-field` argument to preserve back-compat with older data files * Add type hints throughout * Add tests
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1508 +/- ##
==========================================
+ Coverage 69.40% 69.53% +0.12%
==========================================
Files 72 73 +1
Lines 7755 7788 +33
Branches 1900 1905 +5
==========================================
+ Hits 5382 5415 +33
Misses 2087 2087
Partials 286 286 ☔ View full report in Codecov by Sentry. |
The one failing test is expected. |
975262e
to
ee3be71
Compare
) -> Generator[dict, None, None]: | ||
for record in records: | ||
database = record.get("database", "") | ||
if database in {"GenBank", "RefSeq"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking, for discussion only
A part of me just wants to remove the database
check and let people use the command at their own discretion...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, what's the downside to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally written for mpox, where at one point we had a mix of data sources going through the same curation pipeline. This database check allowed the script to skip records that were not GenBank records.
So I guess the downside is people would not be able to use the same curation pipeline for multiple data sources...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it's going to no-op if the record doesn't have the geo_loc_name
field (or whatever --location-field
is set to); did the non-GenBank records have that field? Seems to me like duck-typing on the field you're operating on is a better check than keying off a second field, but I am but an egg here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's going to no-op if the record doesn't have the geo_loc_name field (or whatever --location-field is set to); did the non-GenBank records have that field?
Ah, I don't think it did, but we will be enforcing that all records have the same field with #1510. Then I'd think the non-GenBank records should just have an empty geo_loc_name
field so yeah no-op with some warning messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking in #1511
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, just grabbed that. probably won't get to it until tomorrow tho.
Per discussion on #1508 and #1511, the field standardization across records (cf #1510) makes the need to verify a `database` field less important — essentially, if there's a `geo_loc_name` field (or a field with the name given in the `--location-field` argument), parse it. Otherwise, warn that it's not found.
Description of proposed changes
This ports
translate-genbank-location
from nextstrain/ingest to theaugur curate
sub-commandparse-genbank-location
.It adds a command-line flag for providing the field name the location data is stored in; this flag defaults to
geo_loc_name
to support NCBI's recent change. I provided this to enable the command to also parse older data with thecountry
field name.Also adds type-hints and tests.
Related issue(s)
augur curate
ingest#43Checklist