Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Support relative dates for --min-date and --max-date #740

Merged
merged 5 commits into from
Apr 11, 2022

Conversation

benjaminotter
Copy link
Contributor

@benjaminotter benjaminotter commented Jun 24, 2021

Description of proposed changes

The relative dates are parsed by numeric_date which uses datetime.date.today() to translate the relative date to an absolute date.

Relative dates are positive duration values following the ISO 8601 duration syntax
e.g. --min-date 1Y2W5D for 1 year, 2 weeks and 5 days ago or --max-date 1D for yesterday

This also adds a package dependency isodate to parse the duration string.

Related issue(s)

Fixes #721

Testing

Added pytests.

@codecov
Copy link

codecov bot commented Jun 24, 2021

Codecov Report

Merging #740 (0dbb930) into master (688348b) will increase coverage by 0.70%.
The diff coverage is 100.00%.

❗ Current head 0dbb930 differs from pull request most recent head 29a5a65. Consider uploading reports for the commit 29a5a65 to get more accurate results

@@            Coverage Diff             @@
##           master     #740      +/-   ##
==========================================
+ Coverage   34.15%   34.86%   +0.70%     
==========================================
  Files          41       41              
  Lines        5979     6041      +62     
  Branches     1534     1552      +18     
==========================================
+ Hits         2042     2106      +64     
- Misses       3855     3862       +7     
+ Partials       82       73       -9     
Impacted Files Coverage Δ
augur/filter.py 71.60% <100.00%> (+3.32%) ⬆️
augur/utils.py 43.83% <0.00%> (+0.57%) ⬆️
augur/parse.py 65.81% <0.00%> (+3.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 688348b...29a5a65. Read the comment docs.

@huddlej
Copy link
Contributor

huddlej commented Jun 25, 2021

Thank you for catching this issue, @benjaminotter. Reading the ISO 8601 standard more closely (page 19), I see that we should provide a leading "P" to the duration specification like "P2Y" for 2 years. In this case, pandas fails to parse that interval reporting:

ValueError: Invalid ISO 8601 Duration format - P2Y

In this case, I believe pandas is wrong and the ISO spec is right, but that doesn't help us fix the problem. I'll have to think about this a bit more and look for alternative solutions.

@huddlej
Copy link
Contributor

huddlej commented Jun 25, 2021

For what it's worth, the isodate package has the correct behavior with the above example:

import isodate

>>> isodate.parse_duration("P2Y")
isodate.duration.Duration(0, 0, 0, years=2, months=0)

For some reason, I have this package installed already, but I don't know yet what other Nextstrain dependency installs it.

One annoying aspect of the ISO 8601 format is the requirement to provide a leading "P" for the date period values. For Augur, I don't see us needing to support the time period values of hours, minutes, and seconds, so maybe we can automatically prefix a user's given interval string with a "P" if they don't provide that as part of the string.

@benjaminotter
Copy link
Contributor Author

ISO8601 duration string parsing with isodate does the job, but leaves a Deprecation Warning:

DeprecationWarning: an integer is required (got type decimal.Decimal).  
Implicit conversion to integers using __int__ is deprecated, and may be 
removed in a future version of Python.
    newdt = other.replace(year=newyear, month=newmonth, day=newday)

Also, the isodate package was by default not installed on my local Nextstrain build.

I added a check to make sure the max-date-offset and min_date_offset strings start with a "P" to prevent failing when the user does not provide the prefix letter.

@huddlej
Copy link
Contributor

huddlej commented Dec 30, 2021

I wanted to revisit this issue, since we still need to add this date offsets feature for ncov and seasonal-flu workflows. I looked into several alternatives for representing date durations with and without support for ISO-8601 parsing including Arrow, Python’s built-in datetime timedelta, isodate, isoduration, and pandas. Below are some example outputs from each of these for durations from common use cases of 2 years, 6 months, and 4 weeks.

import arrow
import datetime
import isodate
import isoduration
import pandas as pd

# Examples of how to get time delta objects for:
# - 2 years
# - 6 months
# - 4 weeks

# datetime built-in only supports weeks or smaller units
# of time. Years and months have to be converted to the
# closest equivalent in weeks.
#
# 2 years
datetime.timedelta(weeks=52*2)
# datetime.timedelta(days=728)
# 6 months
datetime.timedelta(weeks=int(52*0.5))
# datetime.timedelta(days=182)
# 4 weeks
datetime.timedelta(weeks=4)
# datetime.timedelta(days=28)

# pandas
# 2 years
pd.Timedelta("-2Y")
# FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version
# Timedelta('-731 days +12:21:36')
#
# pandas generally does not want to support these relative
# offset types. For more details see:
# https://github.com/pandas-dev/pandas/issues/16344
#
# 6 months
pd.Timedelta("-6M")
# FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version
# Timedelta('-1 days +23:54:00')
# 4 weeks
pd.Timedelta("-4W")
# Timedelta('-28 days +00:00:00')

# arrow does not have a representation of a duration type,
# so we need to create a date object and then "shift" it.
d = arrow.now()
d
# <Arrow [2021-12-29T13:32:46.801604-08:00]>
#
# 2 years
d.shift(years=-2)
# <Arrow [2019-12-29T13:32:46.801604-08:00]>
# 6 months
d.shift(months=-6)
# <Arrow [2021-06-29T13:32:46.801604-07:00]>
# 4 weeks
d.shift(weeks=-4)
# <Arrow [2021-12-01T13:32:46.801604-08:00]>

# isodate was originally developed by one person and has
# had 18 total contributors. It has also been forked into
# a GitHub organization called "isodate" without a clear
# reason. It provides a `parse_duration` function that 
# returns a `Duration` instance that can be added to other
# standard datetime objects. Its parser tries to adhere to
# the ISO-8601 standard completely, such that "2Y" is not
# valid but "P2Y" is. This library only depends on Python's
# datetime library.
#
# 2 years
isodate.parse_duration("-P2Y")
# isodate.duration.Duration(0, 0, 0, years=-2, months=0)
# 6 months
isodate.parse_duration("-P6M")
# isodate.duration.Duration(0, 0, 0, years=0, months=-6)
# 4 weeks
isodate.parse_duration("-P4W")
# datetime.timedelta(days=-28)

# isoduration reimplements the logic from isodate using
# Python's datetime library and the third-party arrow library.
# isoduration claims to address limitations of isodate. Its 
# `parse_duration` function returns a `Duration` instance.
#
# 2 years
isoduration.parse_duration("-P2Y")
# Duration(DateDuration(years=Decimal('-2'), months=Decimal('0'), days=Decimal('0'), weeks=Decimal('0')), TimeDuration(hours=Decimal('0'), minutes=Decimal('0'), seconds=Decimal('0')))
# 6 months
isoduration.parse_duration("-P6M")
# Duration(DateDuration(years=Decimal('0'), months=Decimal('-6'), days=Decimal('0'), weeks=Decimal('0')), TimeDuration(hours=Decimal('0'), minutes=Decimal('0'), seconds=Decimal('0')))
# 4 weeks
isoduration.parse_duration("-P4W")
# Duration(DateDuration(years=Decimal('0'), months=Decimal('0'), days=Decimal('0'), weeks=Decimal('-4')), TimeDuration(hours=Decimal('0'), minutes=Decimal('0'), seconds=Decimal('0')))

If we implemented our own minimal duration parsing for Augur to support year, month, week, and day offsets, we would still likely need to implement much of the algorithm to add durations to datetimes that isodate already implements. There is also always the option of straight-up copying the isodate logic into Augur and using the bits we need.

Ultimately, we want to support an interface like this:

# Select strains collected no earlier than 2 months ago and no later than 1 day ago (for SARS-CoV-2 analysis).
augur filter --min-date-offset "2M" --max-date-offset "1D"

# Select strains collected no earlier than 2 years ago (for influenza A/H3N2 analysis).
augur filter --min-date-offset "2Y"

We don’t especially care about the full ISO-8601 duration implementation (e.g., a duration like “-P2Y3M4D” is unlikely to be useful). Given this simpler interface we need, we could implement durations with datetime.timedelta using built-in support for weeks and days and defining our own approximations of years and months. This would get messy, though, and produce odd results for larger durations like 12 years or 11 months where the mismatch between number of weeks in a year or month could add up (52 weeks = 364 days in Python’s timedelta logic!). For example:

now = datetime.datetime.now()
# datetime.datetime(2021, 12, 29, 13, 51, 52, 541769)

# Subtract 1 year in weeks.
now - datetime.timedelta(weeks=52)
# datetime.datetime(2020, 12, 30, 13, 51, 52, 541769)
# Result is off by 1 day.

# Subtract 2 years in weeks.
now - datetime.timedelta(weeks=104)
# datetime.datetime(2020, 1, 1, 13, 51, 52, 541769)
# Result is off by 3 days.

# Subtract 12 years in weeks.
now - datetime.timedelta(weeks=12*52)
# datetime.datetime(2010, 1, 13, 13, 51, 52, 541769)
# Result is off by 2 weeks!

Maybe we don’t care that the results are so far off, though? In the course of 12 years, what is 2 weeks in error for the resulting analysis?

After playing with these different implementations a bit, I’m leaning toward adding isodate as a dependency and rebasing the code from this into the new filter interface.

@victorlin
Copy link
Member

@huddlej if using isodate/datetime, this won't work with negative dates, for example in this discussion post. Would that be appropriate if documented as a feature gap, or should we rethink implementation for this edge case?

@huddlej
Copy link
Contributor

huddlej commented Jan 13, 2022

@victorlin Good point. I'm still not convinced we need to treat negative dates as a first-class use case (as I've mentioned in our Slack discussion). This feature in particular supports the kind of filtering for recent samples we need in the seasonal flu and ncov workflows and that might be strange to implement in an ancient pathogen analysis.

That said, I don't think the proposed interface prevents us from adding support for negative dates in the future, if it becomes necessary. For example, you could run the following to filter to samples collected 10,000 years ago:

augur filter --min-date-offset "10000Y"

Then it would be on us to update the augur filter internals to support this logic.

If folks are generally on board with the proposed UI, we can get a lot of value now from the isodate implementation.

@victorlin
Copy link
Member

I can look into updating this, since I'll need to incorporate it into #854.

@victorlin
Copy link
Member

There were merge conflicts with 4877de0 and 87ca73c. Force-pushing a rebase on latest master which puts the current changes in the right places. Still need to add testing.

@victorlin victorlin force-pushed the date-offsets branch 4 times, most recently from 428fdd6 to 413a13e Compare March 29, 2022 21:47
@victorlin victorlin marked this pull request as ready for review March 29, 2022 21:55
@trvrb
Copy link
Member

trvrb commented Mar 29, 2022

Really nice to have this implemented @victorlin. I believe you were working with --min-date-offset and --max-date-offset as initial proposal from @benjaminotter? I would suggest different syntax. If I didn't come in with context that these are meant as dates relative to today I wouldn't know what this refers to. My guess would have actually been that if you supply say --min-date of 2021-01-01 that then --min-date-offset of 1W would apply on top of 2021-01-01 resulting in 2021-01-08 (yes, forwards in time would be the expectation).

I don't know if this creates too much ambiguity in terms of parsing, but my preferred user behavior (or at least what seems the most obvious to me) is that there is only --min-date that can take 2021-01-01 to have minimal date set to Jan 1, 2021 or could take 1Y to set minimal date back 1 year from today for 2021-03-29.

Maximum dates would work similarly. You could have a pair that specifies --min-date of 2021-01-01 and --max-date of 1W to give a span from 2021-01-01 to 2022-03-22.

@victorlin
Copy link
Member

@trvrb: I believe you were working with --min-date-offset and --max-date-offset as initial proposal from @benjaminotter?

Yes, most of the implementation was already done, I just fixed the merge conflicts and added tests. I also found user experience a bit odd, in that it is a positive duration translating to a negative offset from date.today(). But positive offset from date.today() (a future time) wouldn't make sense so I figured this is fine, if we want to keep it as offsets from date.today() strictly.

So I think there's a question of if we want to support offsets from a certain date, for example:

  1. --min-date 2021-01-01 --min-date-offset 1W == --min-date 2021-01-08
  2. --min-date 2021-01-01 --min-date-offset -1W == --min-date 2020-12-25

If not (keeping offsets strictly from date.today()), then it might make more sense to overload --min-date to support ISO durations in addition to ISO dates, numeric dates, ambiguous dates, etc.

I can imagine most use cases would offset from date.today(). There's no point in hardcoding --min-date 2021-01-01 --min-date-offset 1W since that obviously translates to --min-date 2021-01-08. If we do want to support offsets from a certain date, it would be up to the user or encompassing workflow to fill in today for --min-date <today> --min-date-offset 1W.

@trvrb
Copy link
Member

trvrb commented Mar 29, 2022

If not (keeping offsets strictly from date.today()), then it might make more sense to overload --min-date to support ISO durations in addition to ISO dates, numeric dates, ambiguous dates, etc.

This would be my preference. The example I had above of --min-date of 2021-01-01 with --min-date-offset of 1W resulting in 2021-01-08 was my guess at how the command line interface with this naming would behave, not desired behavior.

There's no point in hardcoding --min-date 2021-01-01 --min-date-offset 1W since that obviously translates to --min-date 2021-01-08.

I agree. I can't see myself as a user ever wanting to do something like this.

I think "offsets" are just fine only subtracting from date.today() and overloading --min-date and --max-date to accomplish this.

@jameshadfield
Copy link
Member

I'll chime in with my 2 cents 😂

I think "offsets" are just fine only subtracting from date.today() and overloading --min-date and --max-date to accomplish this.

This is the behavior which makes sense to me. You could have a separate argument, which would be mutually exclusive with the corresponding date argument, but it's much nicer to overload them.

@huddlej
Copy link
Contributor

huddlej commented Mar 30, 2022

Thank you for the comments, all! The original UI I envisioned for this was the overloaded min/max date arguments. When it came time to implement that overloading, we ran into the minor issue that min/max date arguments are implemented in argparse as a custom numeric_date type that would need to be extended or replaced to support the overloading we'd want.

Replacing numeric_date felt like asking too much of Benjamin at the time (getting into the weeds of Augur's inconsistent date handling), when the goal of this PR was a quick feature prototype. If the consensus is for the overloaded min/max args (which it seems to be, so far), we just need to handle that numeric_date type update.

Regarding this issue:

If I didn't come in with context that these are meant as dates relative to today I wouldn't know what this refers to.

I don't think any of the proposed syntaxes so far address this potential user confusion. Even the overloaded min/max arguments don't explicitly communicate that the offsets are relative to the current date. The original plan to handle this by clearly describing the offsets in the help text. I also thought the separate offset args might help call out the different behavior from min/max date arguments, but maybe there's a better way...

@trvrb
Copy link
Member

trvrb commented Mar 30, 2022

Thanks for context here @huddlej. Very helpful.

I still think that --max-date 1W is pretty intuitive in wanting the maximum date to be 1 week from today, but I take your point.

The original plan to handle this by clearly describing the offsets in the help text. I also thought the separate offset args might help call out the different behavior from min/max date arguments, but maybe there's a better way...

I think I prefer overloading --min-date and --max-date, but it was really just the word "offset" in --min-date-offset that was throwing me above. If the separate commands were --min-date-from-today and --max-date-from-today I wouldn't have had this confusion.

@victorlin
Copy link
Member

For --max-date-from-today I still feel it might take a bit of intuition to realize that it's before today, not after. Ideally it should read something like "max date is 1 week ago". What about --max-date-ago 1W to imply temporal direction?

@trvrb
Copy link
Member

trvrb commented Mar 30, 2022

I see. In this case I'd do --max-date-before-today, but again I still prefer just overloading --max-date.

@trvrb
Copy link
Member

trvrb commented Mar 30, 2022

Also, as @huddlej notes here #721 (comment), we should plan to use the same interface for augur frequencies, which also currently takes --min-date and --max-date. I obviously wouldn't wrap changes to augur frequencies into this PR, but writing here as a reminder about this. For the planned ncov partitioning into 6m and all-time targets, it would be useful to have this interface also in place for augur frequencies.

@victorlin
Copy link
Member

Ok, I'll update PR to overload --min-date and --max-date.

augur/filter.py Outdated Show resolved Hide resolved
@victorlin victorlin force-pushed the date-offsets branch 2 times, most recently from 5a1ef13 to a462043 Compare April 6, 2022 22:27
benjaminotter and others added 3 commits April 6, 2022 15:45
The relative dates are parsed by `numeric_date` which uses datetime.date.today() to translate the relative date to an absolute date.

Relative dates are positive duration values following the ISO 8601 duration syntax
e.g. `--min-date 1Y2W5D` for 1 year, 2 weeks and 5 days ago or `--max-date 1D` for yesterday

This also adds a package dependency `isodate` to parse the duration string.
The previous change to support relative dates also refactored so that an invalid date does not raise any error, which is unwanted.
Prior to that change, there was one try/catch so treetime.utils.numeric_date would handle all these invalid dates.

Explicitly raising from augur.filter.numeric_date is one step in a better direction, though still not a good solution since argparse does not expose these errors.
More: https://stackoverflow.com/q/38340252
augur/filter.py Outdated Show resolved Hide resolved
- Expose the invalid date error message using argparse.ArgumentTypeError
- Split out the text describing supported dates into a constant variable, since it is used in 3 places now
tests/test_filter.py Outdated Show resolved Hide resolved
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for addressing my sporadic comments!

@victorlin
Copy link
Member

Thanks for the review @joverlee521! Force-pushing a small test order fix then merging.

Previously, some tests for --max-date were not done for --min-date and vice-versa.
Added tests and re-ordered so that similar tests are adjacent.
@victorlin victorlin merged commit b464692 into master Apr 11, 2022
@victorlin victorlin deleted the date-offsets branch April 11, 2022 23:48
victorlin added a commit that referenced this pull request Apr 12, 2022
With #740, filter's numeric_date() provides support for 3 date formats.
However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()).

This commit:

1. Moves numeric_date() to a new submodule augur.utils.date_parsing
2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.utils.date_parsing
3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage
4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
victorlin added a commit that referenced this pull request Apr 12, 2022
With #740, filter's numeric_date() provides support for 3 date formats.
However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()).

This commit:

1. Moves numeric_date() to a new submodule augur.utils.date_parsing
2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.utils.date_parsing
3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage
4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
victorlin added a commit that referenced this pull request Apr 12, 2022
With #740, filter's numeric_date() provides support for 3 date formats.
However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()).

This commit:

1. Moves numeric_date() to a new submodule augur.utils.date_parsing
2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.utils.date_parsing
3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage
4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
victorlin added a commit that referenced this pull request Apr 15, 2022
With #740, filter's numeric_date() provides support for 3 date formats.
However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()).

This commit:

1. Moves numeric_date() to a new submodule augur.dates
2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.dates
3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage
4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
@victorlin victorlin added this to the Next release X.X.X milestone Apr 15, 2022
victorlin added a commit to nextstrain/bioconda-recipes that referenced this pull request Apr 15, 2022
BiocondaBot pushed a commit to bioconda/bioconda-recipes that referenced this pull request Apr 16, 2022
Merge PR #34344, commits were: 
 * Add isodate dependency to augur

Added in nextstrain/augur#740
 * Update augur to 15.0.0
@victorlin
Copy link
Member

I'm deleting an old branch filter-date-offsets which strays from master by 1 commit: 00ec319

It isn't associated with any closed PR and looks like an earlier implementation of what got merged in this PR, so I'm linking it here for posterity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy problem Requires less work than most issues enhancement New feature or request priority: moderate To be resolved after high priority issues
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

filter: Support date offsets for min/max date thresholds
6 participants