New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

io/read_metadata: ignore pandas DtypeWarning #1238

Merged

joverlee521 merged 1 commit into master from suppress-dtype-warning

Jul 7, 2023

Merged

io/read_metadata: ignore pandas DtypeWarning #1238

augur/io/metadata.py

Member

victorlin Jun 14, 2023

I haven't tested, but does this actually suppress the DtypeWarnings when chunksize is specified? I'd think that inferred types would still be mixed as long as data is read in chunks where data is split in a way that yields differently inferred types. Or does pandas do something before reading the chunks where it loads all values in a column at once to infer data type?

Contributor Author

joverlee521 Jun 15, 2023

From my understanding, when chunksize is specified, each chunk is read separately so pandas does not compare dtypes across the chunks. If you specified chunksize=1, you would never see any dtype warnings.

This is different than pandas internal chunking with low_memory=True, where it does compare dtypes across chunks because they get loaded into the same dataframe.

Member

victorlin Jun 14, 2023

As you've noted, read_metadata surfaces the chunksize parameter (via chunk_size). However, it's only surfaced on the CLI level by augur filter --metadata-chunk-size. Does this mean other subcommands are prone to out of memory errors?

I haven't tested, but an example would be if one uses a large metadata file with augur frequencies. With the default low_memory=True, Pandas would still make this work by splitting into chunks under the hood. With low_memory=False, that wouldn't be the case.

Potential ways to address this:

Surface --metadata-chunk-size in all subcommands that call read_metadata.
Force read_metadata to always return a chunk iterator.

Contributor Author

joverlee521 Jun 15, 2023

Does this mean other subcommands are prone to out of memory errors?

Maybe? As it stands, the other subcommands already read the entire metadata file into memory. The memory errors would come from pandas having to do type inference across the whole column rather than in chunks.

Member

victorlin Jun 20, 2023

I see – if there is any file that would give memory issues with low_memory=False then it would already have memory issues with the current way it's being read in entirety.

@corneliusroemer I don't think low_memory=False will cause OOM crashes. Either (1) the data is already read in chunks to prevent OOM crashes (done in augur filter), or (2) OOM crashes would already be happening due to absence of chunking.

victorlin marked this conversation as resolved.

Show resolved Hide resolved

-Original file line number
+Diff line change
@@ Expand Up @@
             "engine": "c",
             "skipinitialspace": True,
             "na_filter": False,
+            "low_memory": False,
         }
         if chunk_size:
@@ Expand Down @@