-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set date
column to be string
#1235
Conversation
If all dates are year only dates, Pandas infers their type as int, which leads to unexpected error in filter.
Moving in preparation to set dtype for the column in `read_metadata` in following commit. This is also a constant value that should apply to all uses of metadata, not just within augur.filter.
Avoid Pandas type inference for the date column because it could contain incomplete year only dates that get inferred as int.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1235 +/- ##
=======================================
Coverage 68.87% 68.88%
=======================================
Files 64 64
Lines 6937 6939 +2
Branches 1693 1693
=======================================
+ Hits 4778 4780 +2
Misses 1854 1854
Partials 305 305
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Test works as expected for me locally (failing then passing).
This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² nextstrain/ncov-ingest@7bde90a ³ #1235 (comment) ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a post-merge review, looks good 👍
This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² nextstrain/ncov-ingest@7bde90a ³ #1235 (comment) ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
Description of proposed changes
Set the dtype for the
date
column to bestring
withinread_metadata
so that we don't run into unexpected errors due to Panda's type inference in any downstream uses of the metadata.Motivated by recent error in ncov workflow (Slack thread)
Testing
Added new functional test for
filter
where we initially saw the type error.Checklist