Set `date` column to be `string` #1235

joverlee521 · 2023-05-26T20:20:35Z

Description of proposed changes

Set the dtype for the date column to be string within read_metadata so that we don't run into unexpected errors due to Panda's type inference in any downstream uses of the metadata.

Motivated by recent error in ncov workflow (Slack thread)

Testing

Added new functional test for filter where we initially saw the type error.

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

If all dates are year only dates, Pandas infers their type as int, which leads to unexpected error in filter.

Moving in preparation to set dtype for the column in `read_metadata` in following commit. This is also a constant value that should apply to all uses of metadata, not just within augur.filter.

Avoid Pandas type inference for the date column because it could contain incomplete year only dates that get inferred as int.

codecov · 2023-05-26T20:37:46Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b61e3e7) 68.87% compared to head (def0840) 68.88%.

❗ Current head def0840 differs from pull request most recent head eff0328. Consider uploading reports for the commit eff0328 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1235   +/-   ##
=======================================
  Coverage   68.87%   68.88%           
=======================================
  Files          64       64           
  Lines        6937     6939    +2     
  Branches     1693     1693           
=======================================
+ Hits         4778     4780    +2     
  Misses       1854     1854           
  Partials      305      305

Impacted Files	Coverage Δ
augur/filter/constants.py	`100.00% <ø> (ø)`
augur/filter/__init__.py	`100.00% <100.00%> (ø)`
augur/filter/include_exclude_rules.py	`97.74% <100.00%> (+0.01%)`	⬆️
augur/filter/subsample.py	`98.57% <100.00%> (+0.01%)`	⬆️
augur/io/metadata.py	`96.29% <100.00%> (+0.02%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

tsibley

LGTM. Test works as expected for me locally (failing then passing).

augur/io/metadata.py

@tsibley

This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² nextstrain/ncov-ingest@7bde90a ³ #1235 (comment) ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245

victorlin

Did a post-merge review, looks good 👍

@tsibley

This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² nextstrain/ncov-ingest@7bde90a ³ #1235 (comment) ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245

joverlee521 added 3 commits May 26, 2023 13:04

tests: filter: add failing test with year only dates

4e1a830

If all dates are year only dates, Pandas infers their type as int, which leads to unexpected error in filter.

Move METADATA_DATE_COLUMN to augur.io.metadata

bd4b5d8

Moving in preparation to set dtype for the column in `read_metadata` in following commit. This is also a constant value that should apply to all uses of metadata, not just within augur.filter.

read_metadata: add dtype for METADATA_DATE_COLUMN

def0840

Avoid Pandas type inference for the date column because it could contain incomplete year only dates that get inferred as int.

joverlee521 requested a review from a team May 26, 2023 20:20

tsibley approved these changes May 26, 2023

View reviewed changes

augur/io/metadata.py Show resolved Hide resolved

Update changelog

eff0328

joverlee521 merged commit fe920c7 into master May 26, 2023

joverlee521 deleted the set-date-col-dtype branch May 26, 2023 21:46

joverlee521 mentioned this pull request Jun 13, 2023

io/read_metadata: ignore pandas DtypeWarning #1238

Merged

1 task

victorlin reviewed Jun 14, 2023

View reviewed changes

victorlin mentioned this pull request Jul 6, 2023

Use less dtype inference when reading metadata into DataFrames #1252

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `date` column to be `string` #1235

Set `date` column to be `string` #1235

joverlee521 commented May 26, 2023

codecov bot commented May 26, 2023 •

edited

Loading

tsibley left a comment

victorlin left a comment

Set date column to be string #1235

Set date column to be string #1235

Conversation

joverlee521 commented May 26, 2023

Description of proposed changes

Testing

Checklist

codecov bot commented May 26, 2023 • edited Loading

Codecov Report

tsibley left a comment

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

Set `date` column to be `string` #1235

Set `date` column to be `string` #1235

codecov bot commented May 26, 2023 •

edited

Loading