Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break utils.ambiuous_date_to_date_range to new class #532

Merged

Conversation

elebow
Copy link
Contributor

@elebow elebow commented Apr 8, 2020

Description of proposed changes

Rewrite utils.ambiguous_date_to_date_range as a new class.

This slightly expands the function's capabilities: any X character anywhere in the string will be disambiguated, rather than just at the least significant positions. It should be fully backwards-compatible.

Related issue(s)

none

Testing

Unit tests included.

@elebow elebow force-pushed the break-date-disambiguaution-out-to-new-class branch from 74599e4 to 72086e7 Compare April 9, 2020 03:46
@codecov
Copy link

codecov bot commented Apr 9, 2020

Codecov Report

Merging #532 into master will increase coverage by 0.46%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #532      +/-   ##
==========================================
+ Coverage   20.91%   21.38%   +0.46%     
==========================================
  Files          31       32       +1     
  Lines        5136     5159      +23     
  Branches     1302     1303       +1     
==========================================
+ Hits         1074     1103      +29     
+ Misses       4010     4004       -6     
  Partials       52       52              
Impacted Files Coverage Δ
augur/util_support/date_disambiguator.py 93.75% <93.75%> (ø)
augur/utils.py 28.08% <100.00%> (-1.94%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f50e8f0...4f6f238. Read the comment docs.

@tsibley
Copy link
Member

tsibley commented Apr 9, 2020

👍 for rewriting this!

This slightly expands the function's capabilities: any X character anywhere in the string will be disambiguated, rather than just at the least significant positions. It should be fully backwards-compatible.

However, I'm not sure that this behaviour expansion is desirable. 20X0 or 2020-XX-05 both seem to me more likely to be bogus data, not accurately vague time period data. I think folks doing analyses with Augur would be less surprised if 2020-XX-05 (for example) was rejected instead of taken at face value. That is, it seems more likely to be a data entry/curation error (perhaps intended to be 2020-05-XX) than a real data point considering the meaning of "The 5th of some month in 2020." vs. "May 2020". Consider also that ISO 8601 allows YYYY and YYYY-MM but not YYYY--DD. :-)

@emmahodcroft
Copy link
Member

I agree with @tsibley on this!

@elebow
Copy link
Contributor Author

elebow commented Apr 9, 2020

Makes sense. I'll continue the legacy behavior here and open a followup issue to discuss possible enhancement.

@elebow
Copy link
Contributor Author

elebow commented Apr 10, 2020

Actually, I misread the legacy implementation when I wrote that note about expanding the functionality—Augur already behaves this way (except for a detail regarding the year).

# on current master branch

from augur.utils import ambiguous_date_to_date_range

>>> ambiguous_date_to_date_range("2000-01-01", "%Y-%m-%d")
(datetime.date(2000, 1, 1), datetime.date(2000, 1, 1))

>>> ambiguous_date_to_date_range("2000-01-XX", "%Y-%m-%d")
(datetime.date(2000, 1, 1), datetime.date(2000, 1, 31))

>>> ambiguous_date_to_date_range("2000-XX-05", "%Y-%m-%d")
(datetime.date(2000, 1, 5), datetime.date(2000, 12, 5))


# when year is uncertain, it falls back to the final (optional) argument:

>>> ambiguous_date_to_date_range("20XX-03-05", "%Y-%m-%d")
(None, None)

>>> ambiguous_date_to_date_range("20XX-03-05", "%Y-%m-%d", min_max_year=(1992, 2003))
(datetime.date(1992, 3, 5), datetime.date(2003, 3, 5))

I suppose there are three paths forward:

  1. Keep the PR as-is. Disambiguate any X character, regardless of place.

  2. Imitate the legacy behavior exactly. Disambiguate any XX string regardless of place, or use the third argument if the year component contains any uncertainty.

  3. Do something new and conservative. Perhaps require any uncertain digits to be followed by only uncertain digits in less significant places:

    • 2000-01-XX – valid
    • 2000-XX-XX – valid
    • 2000-XX-01 – not valid—constrained digit in a less significant place than an ambiguous digit (assuming %Y-%m-%d)
    • 2000-01-0X – could be valid

@tsibley
Copy link
Member

tsibley commented May 5, 2020

@elebow Oh, weird. Nice digging! I would prefer (3) and would consider it a kind of bug fix, but (2) might be more called for so that we don't break compat.

@rneher
Copy link
Member

rneher commented May 31, 2020

Thanks for this! I agree with @tsibley that behavior 3 is preferable.

This slightly expands the function's capabilities: any "X" character
anywhere in the string will be disambiguated, rather than just at the
least significant positions.
augur.utils.ambiguous_date_to_date_range resolves ambiguous dates into
tuples of the min and max possible dates. Uncertainty is conveyed by
using the "X" character in the date pattern.

But it usually only makes sense for uncertainty to be in the
least-significant places. For example, assuming %Y-%m-%d, a pattern of
2000-XX-05 would mean the 5th day of any month in 2000. This is almost
certainly erroneous data and undesired behavior.

Therefore, DateDisambiguator now raises an error when a constrained
digit appears in a less-significant place than an uncertain digit.
Ongoing refactoring efforts have been trying to introduce new packages.
All packages must be listed in the packages argument to
`setuptools.setup()`.

`setuptools.find_packages()` provides a convenient way to list all
packages without manually updating a list.
@elebow elebow force-pushed the break-date-disambiguaution-out-to-new-class branch from b7bb8f0 to 4f6f238 Compare June 7, 2020 02:38
@elebow
Copy link
Contributor Author

elebow commented Jun 7, 2020

Added a check that raises a ValueError when a constrained digit appears in a less-significant place than an uncertain digit. (This is the same error class raised for malformed dates like 2000-02-02-02)

Also rebased on latest master.

Note that this still includes the cherry-picked commit from #527.

@rneher
Copy link
Member

rneher commented Jun 16, 2020

this looks good to me. @tsibley, do you have additional comments?

@tsibley
Copy link
Member

tsibley commented Jun 18, 2020

LGTM!

@rneher rneher merged commit 6548568 into nextstrain:master Jun 19, 2020
@elebow elebow deleted the break-date-disambiguaution-out-to-new-class branch July 4, 2020 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants