Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port translate-genbank-location to augur curate parse-genbank-location [#1485] #1508

Merged
merged 3 commits into from
Jul 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
* Added a default color for the "Asia" region that will be used in `augur export` is no custom colors are provided. [#1490][] (@joverlee521)
* Added a new sub-command `augur curate apply-record-annotations` to apply user curated annotations to existing fields in a metadata file. Previously, this was available as a `merge-user-metadata` in the nextstrain/ingest repo. [#1495][] (@joverlee521)
* Added a new sub-command `augur curate abbreviate-authors` to abbreviate lists of authors to "<first author> et al." Previously, this was avaliable as the `transform-authors` script within the nextstrain/ingest repo. [#1483][] (@genehack)

* Added a new sub-command `augur curate parse-genbank-location` to parse the `geo_loc_name` field from GenBank reconds. Previously, this was available as the `translate-genbank-location` script within the nextstrain/ingest repo. [#1485][] (@genehack)
### Bug Fixes

* filter: Improve speed of checking duplicates in metadata, especially for large files. [#1466][] (@victorlin)
Expand Down
3 changes: 2 additions & 1 deletion augur/curate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from augur.io.metadata import DEFAULT_DELIMITERS, InvalidDelimiter, read_table_to_dict, read_metadata_with_sequences, write_records_to_tsv
from augur.io.sequences import write_records_to_fasta
from augur.types import DataErrorMethod
from . import format_dates, normalize_strings, passthru, titlecase, apply_geolocation_rules, apply_record_annotations, abbreviate_authors
from . import format_dates, normalize_strings, passthru, titlecase, apply_geolocation_rules, apply_record_annotations, abbreviate_authors, parse_genbank_location


SUBCOMMAND_ATTRIBUTE = '_curate_subcommand'
Expand All @@ -24,6 +24,7 @@
apply_geolocation_rules,
apply_record_annotations,
abbreviate_authors,
parse_genbank_location,
]


Expand Down
86 changes: 86 additions & 0 deletions augur/curate/parse_genbank_location.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""
Parses GenBank's 'location' field of the NDJSON record to 3 separate
fields: 'country', 'division', and 'location'.

Checks that a record is from GenBank by verifying that the 'database'
field has a value of "GenBank" or "RefSeq".
"""

import argparse
from typing import Generator, List
from augur.io.print import print_err
from augur.utils import first_line


def parse_location(
record: dict,
location_field_name: str,
) -> dict:
# Expected pattern for the location field is
# "<country_value>[:<region>][, <locality>]"
#
# See GenBank docs for their "country" field:
# https://www.ncbi.nlm.nih.gov/genbank/collab/country/
location_field = record.get(location_field_name, "")
if not location_field:
print_err(
f"`parse-genbank-location` requires a `{location_field_name}` field; this record does not have one.",
)
# bail early because we're not gonna make any changes
return record

geographic_data = location_field.split(":")

country = geographic_data[0]
division = ""
location = ""

if len(geographic_data) == 2:
division, _, location = geographic_data[1].partition(",")

record["country"] = country.strip()
record["division"] = division.strip()
record["location"] = location.strip()

return record


def register_parser(
parent_subparsers: argparse._SubParsersAction,
) -> argparse._SubParsersAction:
parser = parent_subparsers.add_parser(
"parse-genbank-location",
parents=[parent_subparsers.shared_parser], # type: ignore
help=first_line(__doc__),
)

parser.add_argument(
"--location-field",
default="geo_loc_name",
help="The field containing the location, "
+ "in the format `<geo_loc_name>[:<region>][, <locality>]`",
)

return parser


def run(
args: argparse.Namespace,
records: List[dict],
) -> Generator[dict, None, None]:
for record in records:
database = record.get("database", "")
if database in {"GenBank", "RefSeq"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, for discussion only

A part of me just wants to remove the database check and let people use the command at their own discretion...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, what's the downside to that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally written for mpox, where at one point we had a mix of data sources going through the same curation pipeline. This database check allowed the script to skip records that were not GenBank records.

So I guess the downside is people would not be able to use the same curation pipeline for multiple data sources...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it's going to no-op if the record doesn't have the geo_loc_name field (or whatever --location-field is set to); did the non-GenBank records have that field? Seems to me like duck-typing on the field you're operating on is a better check than keying off a second field, but I am but an egg here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's going to no-op if the record doesn't have the geo_loc_name field (or whatever --location-field is set to); did the non-GenBank records have that field?

Ah, I don't think it did, but we will be enforcing that all records have the same field with #1510. Then I'd think the non-GenBank records should just have an empty geo_loc_name field so yeah no-op with some warning messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking in #1511

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, just grabbed that. probably won't get to it until tomorrow tho.

parse_location(
record,
args.location_field,
)
else:
if database:
error_msg = f"""Database value of {database} not supported for `transform-genbank-location`; must be "GenBank" or "RefSeq"."""
else:
error_msg = "Record must contain `database` field to use `transform-genbank-location.`"

print_err(error_msg)

yield record
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Setup

$ export AUGUR="${AUGUR:-$TESTDIR/../../../../../bin/augur}"

Running the command with no arguments produces the expected output

$ echo '{"geo_loc_name":"Canada:Vancouver", "database":"GenBank"}' \
> | ${AUGUR} curate parse-genbank-location
{"geo_loc_name": "Canada:Vancouver", "database": "GenBank", "country": "Canada", "division": "Vancouver", "location": ""}

`--location-field` can be used to specify a different field name

$ echo '{"location":"Canada:Vancouver", "database":"GenBank"}' \
> | ${AUGUR} curate parse-genbank-location --location-field location
{"location": "", "database": "GenBank", "country": "Canada", "division": "Vancouver"}

`RefSeq` works as a database value

$ echo '{"geo_loc_name":"Canada:Vancouver", "database":"RefSeq"}' \
> | ${AUGUR} curate parse-genbank-location
{"geo_loc_name": "Canada:Vancouver", "database": "RefSeq", "country": "Canada", "division": "Vancouver", "location": ""}

Record with only a `country` value results in expected output

$ echo '{"geo_loc_name":"Canada", "database":"GenBank"}' \
> | ${AUGUR} curate parse-genbank-location
{"geo_loc_name": "Canada", "database": "GenBank", "country": "Canada", "division": "", "location": ""}

Record with `country`, `region`, and `location` values results in expected output

$ echo '{"geo_loc_name":"France:Cote d'\''Azur, Antibes", "database":"GenBank"}' \
> | ${AUGUR} curate parse-genbank-location
{"geo_loc_name": "France:Cote d'Azur, Antibes", "database": "GenBank", "country": "France", "division": "Cote d'Azur", "location": "Antibes"}
31 changes: 31 additions & 0 deletions tests/functional/curate/cram/parse-genbank-location/errors.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Setup

$ export AUGUR="${AUGUR:-$TESTDIR/../../../../../bin/augur}"

Records without a `database` field result in the expected warning

$ echo '{"geo_loc_name":"Canada:Vancouver"}' \
> | ${AUGUR} curate parse-genbank-location
Record must contain `database` field to use `transform-genbank-location.`
{"geo_loc_name": "Canada:Vancouver"}

Records with a `database` field with an unsupported value result in the expected warning

$ echo '{"geo_loc_name":"Canada:Vancouver", "database":"database"}' \
> | ${AUGUR} curate parse-genbank-location
Database value of database not supported for `transform-genbank-location`; must be "GenBank" or "RefSeq".
{"geo_loc_name": "Canada:Vancouver", "database": "database"}

Records without a `location` field result in the expected warning

$ echo '{"database":"GenBank"}' \
> | ${AUGUR} curate parse-genbank-location
`parse-genbank-location` requires a `geo_loc_name` field; this record does not have one.
{"database": "GenBank"}

Records without a `location` field and a custom `--location-field` result in the expected warning

$ echo '{"database":"GenBank"}' \
> | ${AUGUR} curate parse-genbank-location --location-field location
`parse-genbank-location` requires a `location` field; this record does not have one.
{"database": "GenBank"}
Loading