From 7582db730ccda3ddcf28ee186b4febe11e0687ca Mon Sep 17 00:00:00 2001 From: Jover Date: Tue, 13 Jun 2023 10:41:21 -0700 Subject: [PATCH] io/read_metadata: set `low_memory=False` MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² https://github.com/nextstrain/ncov-ingest/pull/386/commits/7bde90a992e30c8b745c5d82ee1ce51bba742e8b ³ https://github.com/nextstrain/augur/pull/1235#discussion_r1207327871 ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245 --- augur/io/metadata.py | 1 + 1 file changed, 1 insertion(+) diff --git a/augur/io/metadata.py b/augur/io/metadata.py index ded54ff80..17277d850 100644 --- a/augur/io/metadata.py +++ b/augur/io/metadata.py @@ -80,6 +80,7 @@ def read_metadata(metadata_file, delimiters=DEFAULT_DELIMITERS, id_columns=DEFAU "engine": "c", "skipinitialspace": True, "na_filter": False, + "low_memory": False, } if chunk_size: