Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: support filtering of ambiguous dates #602

Closed
huddlej opened this issue Aug 12, 2020 · 9 comments · Fixed by #623
Closed

filter: support filtering of ambiguous dates #602

huddlej opened this issue Aug 12, 2020 · 9 comments · Fixed by #623
Assignees
Labels
easy problem Requires less work than most issues enhancement New feature or request good first issue A relatively isolated issue appropriate for first-time contributors priority: low To be resolved after high and moderate priority issues

Comments

@huddlej
Copy link
Contributor

huddlej commented Aug 12, 2020

Context
The augur filter command currently supports filtering by min and max date, but it does so by converting any potentially ambiguous dates to reasonable date ranges before applying the filter. A common use case with augur is to exclude these ambiguous dates completely from an analysis (c.f., ncov and seasonal flu workflows).

Ambiguous dates can cause more harm than good if the missing date information inferred by TreeTime is unrealistic (far in the future from other closely related strains). Additionally, maintaining a list of dates to exclude by "XX" per month (as in the ncov workflow) is laborious and error-prone.

Description
I propose a new --exclude-ambiguous-dates flag for augur filter that will exclude records with ambiguous dates before the min/max date filters are applied. The statistics output from augur filter should clearly indicate how many records were excluded for this reason.

@huddlej huddlej added the enhancement New feature or request label Aug 12, 2020
@huddlej huddlej added easy problem Requires less work than most issues good first issue A relatively isolated issue appropriate for first-time contributors priority: low To be resolved after high and moderate priority issues labels Aug 12, 2020
@jtboing
Copy link

jtboing commented Aug 19, 2020

Hello. May I ask what constitutes ambiguous dates? Are these just dates with missing values or just wrongly formatted dates?

@jameshadfield
Copy link
Member

👍 I'd be in favor of exposing arguments here, e.g. --exclude-ambiguous-dates months or --exclude-ambiguous-dates days as often I want to avoid the former and include the latter, depending on the dataset.

@emmahodcroft
Copy link
Member

emmahodcroft commented Aug 20, 2020

@jtboing What we normally mean with ambiguous dates are dates like the following (if a 'correct' date is like this: 2020-05-30):

  • 20XX-XX-XX
  • 2020-XX-XX
  • 2020-06-XX

These are all format we accept, and will 100% work in the pipeline (we'll try to infer such dates). For many organisms we really need to include them, as we'd have almost no sequences without them! But for some analyses (like SARS-CoV-2, we do not want to include them.

To expand on what @jameshadfield says above a full range of options for --exclude-ambiguous-dates might be all (exclude any date that are ambiguous in any way) day (only those 2020-06-XX), and month (only those 2020-XX-XX - this would implicitly include day as well, as we really never have sequences where we have day but not month). Alternatively all could be called year, as they're essentially the same, but I think all is a bit more intuitive.

I do not believe we handle incorrectly formatted dates - not 100% sure what happens but I think we do expect them to be in YYYY-MM-DD format. However, it might be worth checking if we support the 'alternative' to our 'XX' format:

  • 2020
  • 2020-06
    And if we want to implicitly convert these to 'ambiguous' to then exclude (if wanted). Or if we prefer not to risk ourselves here and just throw these out (maybe we don't want to assume any 4-digit number is a year?). Would be good to check existing behaviour!

@saikirankv
Copy link
Contributor

@huddlej @emmahodcroft can I take this issue?

@emmahodcroft
Copy link
Member

Yes! I don't believe there's been any progress on it but it would be great to address!

@huddlej
Copy link
Contributor Author

huddlej commented Oct 27, 2020

Thank you for volunteering to work on this issue, @saikirankv! We have not discussed this further as a team since @emmahodcroft's last comment, but the solution she describes would be a great one to implement.

This would also be a great opportunity to do some test-driven development by adding appropriate tests to tests/test_filter.py for the different scenarios @emmahodcroft describes above and then adding the code to make them pass. The filter module is somewhat monolithic and confusing to modify, so having tests defined in advance should help give you some confidence that you're achieving what you want with the code you write.

You may want to check out our development docs sections on writing tests and running tests. Running just the tests you're interested in with the -k flag can speed up the test-driven development process a lot.

@saikirankv
Copy link
Contributor

@emmahodcroft @huddlej how can I get sample data files?

@saikirankv
Copy link
Contributor

nvm, I found the test file, started working on it.

@saikirankv
Copy link
Contributor

@emmahodcroft @huddlej can you review the PR, I'll add tests for them. I made changes to get_numerical_dates util method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy problem Requires less work than most issues enhancement New feature or request good first issue A relatively isolated issue appropriate for first-time contributors priority: low To be resolved after high and moderate priority issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants