From 7582db730ccda3ddcf28ee186b4febe11e0687ca Mon Sep 17 00:00:00 2001
From: Jover <joverlee521@gmail.com>
Date: Tue, 13 Jun 2023 10:41:21 -0700
Subject: [PATCH] io/read_metadata: set `low_memory=False`
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This suppresses the `DtypeWarnings` messages from pandas when it infers
different dtypes for a column in the metadata. We do not need pandas to
internally parse files in chunks since we already surface the `chunksize`
parameter to control memory usage. This change was motivated by internal
discussion on Slack about how these warning messages overwhelm the logs
of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with `low_memory=False`
within ncov-ingest². However that was due to the unexpected interaction
with the `usecols` parameter, where the entire file was read before
being subset to the columns provided.

In the future, we may want to explicitly set the dtype to `string` for
all columns in the metadata as suggested by @tsibley in a separate PR.³
However, that will require wider changes throughout Augur where uses of
the metadata may be expecting the inferred dtypes (such as in
augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² https://github.com/nextstrain/ncov-ingest/pull/386/commits/7bde90a992e30c8b745c5d82ee1ce51bba742e8b
³ https://github.com/nextstrain/augur/pull/1235#discussion_r1207327871
⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
---
 augur/io/metadata.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/augur/io/metadata.py b/augur/io/metadata.py
index ded54ff80..17277d850 100644
--- a/augur/io/metadata.py
+++ b/augur/io/metadata.py
@@ -80,6 +80,7 @@ def read_metadata(metadata_file, delimiters=DEFAULT_DELIMITERS, id_columns=DEFAU
         "engine": "c",
         "skipinitialspace": True,
         "na_filter": False,
+        "low_memory": False,
     }
 
     if chunk_size: