-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io/read_metadata: ignore pandas DtypeWarning #1238
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested, but does this actually suppress the DtypeWarnings
when chunksize
is specified? I'd think that inferred types would still be mixed as long as data is read in chunks where data is split in a way that yields differently inferred types. Or does pandas do something before reading the chunks where it loads all values in a column at once to infer data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, when chunksize
is specified, each chunk is read separately so pandas does not compare dtypes across the chunks. If you specified chunksize=1
, you would never see any dtype warnings.
This is different than pandas internal chunking with low_memory=True
, where it does compare dtypes across chunks because they get loaded into the same dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you've noted, read_metadata
surfaces the chunksize
parameter (via chunk_size
). However, it's only surfaced on the CLI level by augur filter --metadata-chunk-size
. Does this mean other subcommands are prone to out of memory errors?
I haven't tested, but an example would be if one uses a large metadata file with augur frequencies
. With the default low_memory=True
, Pandas would still make this work by splitting into chunks under the hood. With low_memory=False
, that wouldn't be the case.
Potential ways to address this:
- Surface
--metadata-chunk-size
in all subcommands that callread_metadata
. - Force
read_metadata
to always return a chunk iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean other subcommands are prone to out of memory errors?
Maybe? As it stands, the other subcommands already read the entire metadata file into memory. The memory errors would come from pandas having to do type inference across the whole column rather than in chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see – if there is any file that would give memory issues with low_memory=False
then it would already have memory issues with the current way it's being read in entirety.
@corneliusroemer I don't think low_memory=False
will cause OOM crashes. Either (1) the data is already read in chunks to prevent OOM crashes (done in augur filter
), or (2) OOM crashes would already be happening due to absence of chunking.
Bit worried this could cause OOM crashes. Can we do this in a backwards compatible manner, or allow opt-out? |
73508e8
to
6a5ac00
Compare
low_memory=False
Since there are concerns of memory errors with setting |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1238 +/- ##
==========================================
+ Coverage 68.88% 68.95% +0.06%
==========================================
Files 64 64
Lines 6939 6964 +25
Branches 1693 1696 +3
==========================================
+ Hits 4780 4802 +22
- Misses 1854 1856 +2
- Partials 305 306 +1
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After clarification in #1238 (comment), I'd prefer the original 73508e8 over 6a5ac00.
This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² nextstrain/ncov-ingest@7bde90a ³ #1235 (comment) ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
6a5ac00
to
7582db7
Compare
This suppresses the
DtypeWarnings
messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface thechunksize
parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹I have seen surprising memory usage in the past with
low_memory=False
within ncov-ingest². However that was due to the unexpected interaction with theusecols
parameter, where the entire file was read before being subset to the columns provided.In the future, we may want to explicitly set the dtype to
string
for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴).¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)
⁴
augur/augur/export_v2.py
Lines 239 to 245 in b61e3e7
Testing
What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?
Checklist