Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set date column to be string #1235

Merged
merged 4 commits into from
May 26, 2023
Merged

Set date column to be string #1235

merged 4 commits into from
May 26, 2023

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Set the dtype for the date column to be string within read_metadata so that we don't run into unexpected errors due to Panda's type inference in any downstream uses of the metadata.

Motivated by recent error in ncov workflow (Slack thread)

Testing

Added new functional test for filter where we initially saw the type error.

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

If all dates are year only dates, Pandas infers their type as int, which
leads to unexpected error in filter.
Moving in preparation to set dtype for the column in `read_metadata`
in following commit.

This is also a constant value that should apply to all uses of metadata,
not just within augur.filter.
Avoid Pandas type inference for the date column because it could contain
incomplete year only dates that get inferred as int.
@joverlee521 joverlee521 requested a review from a team May 26, 2023 20:20
@codecov
Copy link

codecov bot commented May 26, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b61e3e7) 68.87% compared to head (def0840) 68.88%.

❗ Current head def0840 differs from pull request most recent head eff0328. Consider uploading reports for the commit eff0328 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1235   +/-   ##
=======================================
  Coverage   68.87%   68.88%           
=======================================
  Files          64       64           
  Lines        6937     6939    +2     
  Branches     1693     1693           
=======================================
+ Hits         4778     4780    +2     
  Misses       1854     1854           
  Partials      305      305           
Impacted Files Coverage Δ
augur/filter/constants.py 100.00% <ø> (ø)
augur/filter/__init__.py 100.00% <100.00%> (ø)
augur/filter/include_exclude_rules.py 97.74% <100.00%> (+0.01%) ⬆️
augur/filter/subsample.py 98.57% <100.00%> (+0.01%) ⬆️
augur/io/metadata.py 96.29% <100.00%> (+0.02%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Test works as expected for me locally (failing then passing).

augur/io/metadata.py Show resolved Hide resolved
@joverlee521 joverlee521 merged commit fe920c7 into master May 26, 2023
@joverlee521 joverlee521 deleted the set-date-col-dtype branch May 26, 2023 21:46
joverlee521 added a commit that referenced this pull request Jun 13, 2023
This suppresses the `DtypeWarnings` messages from pandas when it infers
different dtypes for a column in the metadata. We do not need pandas to
internally parse files in chunks since we already surface the `chunksize`
parameter to control memory usage. This change was motivated by internal
discussion on Slack about how these warning messages overwhelm the logs
of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with `low_memory=False`
within ncov-ingest². However that was due to the unexpected interaction
with the `usecols` parameter, where the entire file was read before
being subset to the columns provided.

In the future, we may want to explicitly set the dtype to `string` for
all columns in the metadata as suggested by @tsibley in a separate PR.³
However, that will require wider changes throughout Augur where uses of
the metadata may be expecting the inferred dtypes (such as in
augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a post-merge review, looks good 👍

joverlee521 added a commit that referenced this pull request Jul 7, 2023
This suppresses the `DtypeWarnings` messages from pandas when it infers
different dtypes for a column in the metadata. We do not need pandas to
internally parse files in chunks since we already surface the `chunksize`
parameter to control memory usage. This change was motivated by internal
discussion on Slack about how these warning messages overwhelm the logs
of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with `low_memory=False`
within ncov-ingest². However that was due to the unexpected interaction
with the `usecols` parameter, where the entire file was read before
being subset to the columns provided.

In the future, we may want to explicitly set the dtype to `string` for
all columns in the metadata as suggested by @tsibley in a separate PR.³
However, that will require wider changes throughout Augur where uses of
the metadata may be expecting the inferred dtypes (such as in
augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants