Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port translate-genbank-location to augur curate parse-genbank-location [#1485] #1508

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

genehack
Copy link
Contributor

@genehack genehack commented Jul 1, 2024

Description of proposed changes

This ports translate-genbank-location from nextstrain/ingest to the augur curate sub-command parse-genbank-location.

It adds a command-line flag for providing the field name the location data is stored in; this flag defaults to geo_loc_name to support NCBI's recent change. I provided this to enable the command to also parse older data with the country field name.

Also adds type-hints and tests.

Related issue(s)

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

* Convert script over to expected sub-command style
* Add `--location-field` argument to preserve back-compat with older
  data files
* Add type hints throughout
* Add tests
@genehack genehack requested a review from a team July 1, 2024 19:23
Copy link

codecov bot commented Jul 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.53%. Comparing base (f6ee377) to head (ee3be71).
Report is 238 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1508      +/-   ##
==========================================
+ Coverage   69.40%   69.53%   +0.12%     
==========================================
  Files          72       73       +1     
  Lines        7755     7788      +33     
  Branches     1900     1905       +5     
==========================================
+ Hits         5382     5415      +33     
  Misses       2087     2087              
  Partials      286      286              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@genehack
Copy link
Contributor Author

genehack commented Jul 1, 2024

The one failing test is expected.

CHANGES.md Outdated Show resolved Hide resolved
) -> Generator[dict, None, None]:
for record in records:
database = record.get("database", "")
if database in {"GenBank", "RefSeq"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, for discussion only

A part of me just wants to remove the database check and let people use the command at their own discretion...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, what's the downside to that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally written for mpox, where at one point we had a mix of data sources going through the same curation pipeline. This database check allowed the script to skip records that were not GenBank records.

So I guess the downside is people would not be able to use the same curation pipeline for multiple data sources...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it's going to no-op if the record doesn't have the geo_loc_name field (or whatever --location-field is set to); did the non-GenBank records have that field? Seems to me like duck-typing on the field you're operating on is a better check than keying off a second field, but I am but an egg here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's going to no-op if the record doesn't have the geo_loc_name field (or whatever --location-field is set to); did the non-GenBank records have that field?

Ah, I don't think it did, but we will be enforcing that all records have the same field with #1510. Then I'd think the non-GenBank records should just have an empty geo_loc_name field so yeah no-op with some warning messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking in #1511

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, just grabbed that. probably won't get to it until tomorrow tho.

@genehack genehack merged commit a9d1b3a into master Jul 1, 2024
19 of 20 checks passed
@genehack genehack deleted the parse-gb-loc-1485 branch July 1, 2024 21:46
genehack added a commit that referenced this pull request Jul 2, 2024
Per discussion on #1508 and #1511, the field standardization across
records (cf #1510) makes the need to verify a `database` field less
important — essentially, if there's a `geo_loc_name` field (or a field
with the name given in the `--location-field` argument), parse it.
Otherwise, warn that it's not found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants