Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate titlecase #1197

Merged
merged 5 commits into from
Jul 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

## __NEXT__

### Features

* Adds a new sub-command augur curate titlecase. The titlecase command is intended to apply titlecase to string fields in a metadata record (e.g. BRAINE-LE-COMTE, FRANCE -> Braine-le-Comte, France). Previously, this was available in the transform-string-fields script within the monkeypox repo.
[#1197][] (@j23414 and @joverlee521)

[#1197]: https://github.com/nextstrain/augur/pull/1197

### Bug fixes

* export v2: Previously, when `strain` was not used as the metadata ID column, node attributes might have gone missing from the final Auspice JSON. This has been fixed. [#1260][], [#1262][] (@victorlin, @joverlee521)
Expand Down
3 changes: 2 additions & 1 deletion augur/curate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@
from augur.io.metadata import DEFAULT_DELIMITERS, InvalidDelimiter, read_table_to_dict, read_metadata_with_sequences, write_records_to_tsv
from augur.io.sequences import write_records_to_fasta
from augur.types import DataErrorMethod
from . import format_dates, normalize_strings, passthru
from . import format_dates, normalize_strings, passthru, titlecase


SUBCOMMAND_ATTRIBUTE = '_curate_subcommand'
SUBCOMMANDS = [
passthru,
normalize_strings,
format_dates,
titlecase,
]


Expand Down
129 changes: 129 additions & 0 deletions augur/curate/titlecase.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""
Applies titlecase to string fields in a metadata record
"""
import re
from typing import Optional, Set, Union

from augur.errors import AugurError
from augur.io.print import print_err
from augur.types import DataErrorMethod

def register_parser(parent_subparsers):
parser = parent_subparsers.add_parser("titlecase",
parents = [parent_subparsers.shared_parser],
help = __doc__)

required = parser.add_argument_group(title="REQUIRED")
required.add_argument("--titlecase-fields", nargs="*",
help="List of fields to convert to titlecase.", required=True)

optional = parser.add_argument_group(title="OPTIONAL")
optional.add_argument("--articles", nargs="*",
help="List of articles that should not be converted to titlecase.")
optional.add_argument("--abbreviations", nargs="*",
help="List of abbreviations that should not be converted to titlecase, keeps uppercase.")

optional.add_argument("--failure-reporting",
type=DataErrorMethod.argtype,
choices=[ method for method in DataErrorMethod ],
default=DataErrorMethod.ERROR_FIRST,
help="How should failed titlecase formatting be reported.")
return parser


def titlecase(text: Union[str, None], articles: Set[str] = set(), abbreviations: Set[str] = set()) -> Optional[str]:
"""
Originally from nextstrain/ncov-ingest

Returns a title cased location name from the given location name
*tokens*. Ensures that no tokens contained in the *whitelist_tokens* are
converted to title case.

>>> articles = {'a', 'and', 'of', 'the', 'le'}
>>> abbreviations = {'USA', 'DC'}

>>> titlecase("the night OF THE LIVING DEAD", articles)
'The Night of the Living Dead'

>>> titlecase("BRAINE-LE-COMTE, FRANCE", articles)
'Braine-le-Comte, France'

>>> titlecase("auvergne-RHÔNE-alpes", articles)
'Auvergne-Rhône-Alpes'

>>> titlecase("washington DC, usa", articles, abbreviations)
'Washington DC, USA'
"""
if not isinstance(text, str):
return None

words = enumerate(re.split(r'\b', text))

def changecase(index, word):
casefold = word.casefold()
upper = word.upper()

if upper in abbreviations:
return upper
elif casefold in articles and index != 1:
return word.lower()
else:
return word.title()

return ''.join(changecase(i, w) for i, w in words)


def run(args, records):
failures = []
failure_reporting = args.failure_reporting

articles = set()
if args.articles:
articles = set(args.articles)

abbreviations = set()
if args.abbreviations:
abbreviations = set(args.abbreviations)

for index, record in enumerate(records):
record = record.copy()
j23414 marked this conversation as resolved.
Show resolved Hide resolved
record_id = index

for field in args.titlecase_fields:
# Ignore non-existent fields but could change this to warnings if desired
if field not in record:
continue
elif record[field] is None:
continue

titlecased_string = titlecase(record[field], articles, abbreviations)

failure_message = f"Failed to titlecase {field!r}:{record.get(field)!r} in record {record_id!r} because the value is a {type(record.get(field)).__name__!r} and is not a string."
if titlecased_string is None:
if failure_reporting is DataErrorMethod.ERROR_FIRST:
raise AugurError(failure_message)
if failure_reporting is DataErrorMethod.ERROR_ALL:
print_err(f"ERROR: {failure_message}")
if failure_reporting is DataErrorMethod.WARN:
print_err(f"WARNING: {failure_message}")

# Keep track of failures for final summary
failures.append((record_id, field, record.get(field)))
else:
record[field] = titlecased_string

yield record

if failure_reporting is not DataErrorMethod.SILENT and failures:
failure_message = (
"Unable to change to titlecase for the following (record, field, field value):\n" + \
'\n'.join(map(repr, failures))
)
if failure_reporting is DataErrorMethod.ERROR_ALL:
raise AugurError(failure_message)

elif failure_reporting is DataErrorMethod.WARN:
print_err(f"WARNING: {failure_message}")

else:
raise ValueError(f"Encountered unhandled failure reporting method: {failure_reporting!r}")
1 change: 1 addition & 0 deletions docs/usage/cli/curate/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ We will continue to add more subcommands as we identify other common data curati
:maxdepth: 1

normalize-strings
titlecase
format-dates
passthru

9 changes: 9 additions & 0 deletions docs/usage/cli/curate/titlecase.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
=================
titlecase
=================

.. argparse::
:module: augur
:func: make_parser
:prog: augur
:path: curate titlecase
99 changes: 99 additions & 0 deletions tests/functional/curate/cram/titlecase.t
joverlee521 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ export AUGUR="${AUGUR:-../../../../bin/augur}"


Test output with articles and a mixture of lower and uppercase letters.

$ echo '{"title": "the night OF THE LIVING DEAD"}' \
> | ${AUGUR} curate titlecase --titlecase-fields "title" --articles "a" "and" "of" "the" "le"
{"title": "The Night of the Living Dead"}

Test output with hyphenated location.

$ echo '{"location": "BRAINE-LE-COMTE, FRANCE"}' \
> | ${AUGUR} curate titlecase --titlecase-fields "location" --articles "a" "and" "of" "the" "le"
{"location": "Braine-le-Comte, France"}

Test output with unicode characters

$ echo '{"location": "Auvergne-Rhône-Alpes" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "location"
{"location": "Auvergne-Rh\u00f4ne-Alpes"}

Test output with abbreviations

$ echo '{"city": "Washington DC, USA"}' \
> | ${AUGUR} curate titlecase --titlecase-fields "city" --abbreviations "USA" "DC"
{"city": "Washington DC, USA"}

Test output with numbers

$ echo '{"title": "2021 SARS-CoV"}' \
> | ${AUGUR} curate titlecase --titlecase-fields "title" --abbreviations "SARS"
{"title": "2021 SARS-Cov"}

joverlee521 marked this conversation as resolved.
Show resolved Hide resolved
Test output with only numbers

$ echo '{"int": "2021", "float": "2021.10", "address": "2021.20.30" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "int" "float" "address"
{"int": "2021", "float": "2021.10", "address": "2021.20.30"}

Test case that passes on empty or null values

$ echo '{"empty": "", "null_entry":null }' \
> | ${AUGUR} curate titlecase --titlecase-fields "empty" "null_entry"
{"empty": "", "null_entry": null}


Test case that fails on a non-string int

$ echo '{"bare_int": 2021}' \
> | ${AUGUR} curate titlecase --titlecase-fields "bare_int"
ERROR: Failed to titlecase 'bare_int':2021 in record 0 because the value is a 'int' and is not a string.
[2]

Test case that fails on complex types (e.g. arrays)

$ echo '{"an_array": ["hello", "world"]}' \
> | ${AUGUR} curate titlecase --titlecase-fields "an_array"
ERROR: Failed to titlecase 'an_array':['hello', 'world'] in record 0 because the value is a 'list' and is not a string.
[2]

Test cases when fields do not exist, decide if this should error out and may affect ingest pipelines

$ echo '{"region":"europe", "country":"france" }' \
> | ${AUGUR} curate titlecase --titlecase-fields "region" "country" "division" "location" "not exist"
{"region": "Europe", "country": "France"}

Test output with non-string value input with `ERROR_ALL` failure reporting.
This reports a collection of all titlecase failures which is especially beneficial for automated pipelines.

$ echo '{"bare_int": 2021, "bare_float": 1.2}' \
> | ${AUGUR} curate titlecase --titlecase-fields "bare_int" "bare_float" \
> --failure-reporting "error_all" 1> /dev/null
ERROR: Failed to titlecase 'bare_int':2021 in record 0 because the value is a 'int' and is not a string.
ERROR: Failed to titlecase 'bare_float':1.2 in record 0 because the value is a 'float' and is not a string.
ERROR: Unable to change to titlecase for the following (record, field, field value):
(0, 'bare_int', 2021)
(0, 'bare_float', 1.2)
[2]

Test warning on failures such as when encountering a non-string value.

$ echo '{"bare_int": 2021}' \
> | ${AUGUR} curate titlecase --titlecase-fields "bare_int" \
> --failure-reporting "warn"
WARNING: Failed to titlecase 'bare_int':2021 in record 0 because the value is a 'int' and is not a string.
WARNING: Unable to change to titlecase for the following (record, field, field value):
(0, 'bare_int', 2021)
{"bare_int": 2021}

Test silencing on failures such as when encountering a non-string value

$ echo '{"bare_int": 2021}' \
> | ${AUGUR} curate titlecase --titlecase-fields "bare_int" \
> --failure-reporting "silent"
{"bare_int": 2021}

Loading