Use io.read_metadata during export #909

huddlej · 2022-04-28T22:02:26Z

Description of proposed changes

Replaces a call to the older utils.read_metadata function with the newer io.read_metadata function while processing metadata for export to an Auspice JSON. This new function returns a pandas DataFrame indexed by the first viable strain name column found in the metadata file (removing this column from the data itself), while the original function returns a dictionary indexed by strain name (keeping the original named column like strain or name in the data).

To avoid changing the downstream code that consumes the metadata, this commit converts the pandas DataFrame to a dictionary that matches the output of the original function. The main advantage here is that the calling code does not need to know what the id column is named, since io.read_metadata handles this and indexed the data frame by that column.

Importantly, the io.read_metadata function defines a list of accepted id columns to look for and will raise an exception with a helpful error message when the input metadata has no valid id columns. Commit 45df6bd adds a check for this exception in the augur export v2 command such that the user will get a helpful error message on the command line instead of an uncaught exception. An example from a new functional test demonstrates this behavior:

$ augur export v2 \
  --tree tree.nwk \
  --metadata dataset1_metadata_without_valid_id.tsv \
  --node-data div_node-data.json location_node-data.json \
  --auspice-config auspice_config1.json \
  --maintainers "Nextstrain Team" \
  --output test.json
ERROR: None of the possible id columns (('strain', 'name')) were found in the metadata's columns ('invalid_id', 'div', 'mutation_length')
$ echo $?
1

I preferred this implementation over the check for valid id columns in #906 because:

it replaces an unmaintained read_metadata function with a maintained version,
it does not duplicate information about valid id columns in the export module (this information is left to read_metadata),
and it provides the user with a helpful error message when there are no valid id columns.

Related issue(s)

Fixes #905

Testing

Adds functional tests for the expected behavior of export v2 with metadata inputs.
Tested by CI

Replaces a call to the older `utils.read_metadata` function with the newer `io.read_metadata` function while processing metadata for export to an Auspice JSON. This new function returns a pandas DataFrame indexed by the first viable strain name column found in the metadata file (removing this column from the data itself), while the original function returns a dictionary indexed by strain name (keeping the original named column like `strain` or `name` in the data). To avoid changing the downstream code that consumes the metadata, this commit converts the pandas DataFrame to a dictionary that matches the output of the original function. The main advantage here is that the calling code does not need to know what the id column is named, since `io.read_metadata` handles this and indexed the data frame by that column. This commit also adds functional tests for the expected behavior of export v2 with metadata inputs. Fixes #905

Splits up the first load of the metadata (where we check for available id columns) into separate variables for metadata and the first chunk and then explicitly closes the metadata file after the read. This close call is required to prevent seemingly random "unclosed file" warnings from appearing in stderr.

codecov · 2022-05-04T22:51:41Z

Codecov Report

Merging #909 (45df6bd) into master (4b71e7d) will increase coverage by 2.18%.
The diff coverage is 38.46%.

@@            Coverage Diff             @@
##           master     #909      +/-   ##
==========================================
+ Coverage   34.63%   36.81%   +2.18%     
==========================================
  Files          42       42              
  Lines        6006     6886     +880     
  Branches     1538     1959     +421     
==========================================
+ Hits         2080     2535     +455     
- Misses       3853     4221     +368     
- Partials       73      130      +57

Impacted Files	Coverage Δ
augur/export_v2.py	`12.58% <20.00%> (+1.47%)`	⬆️
augur/io.py	`100.00% <100.00%> (ø)`
augur/distance.py	`46.21% <0.00%> (-2.15%)`	⬇️
augur/tree.py	`13.31% <0.00%> (+0.09%)`	⬆️
augur/filter.py	`71.02% <0.00%> (+0.64%)`	⬆️
augur/align.py	`90.16% <0.00%> (+1.17%)`	⬆️
augur/titer_model.py	`23.48% <0.00%> (+3.51%)`	⬆️
augur/utils.py	`50.00% <0.00%> (+6.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b71e7d...45df6bd. Read the comment docs.

Adds a functional test for how we'd like augur export v2 to behave when the user provides metadata without a valid id and then updates the try/except block in export_v2 to catch and print the error from `read_metadata` when it occurs. While we're at it, this commit also prints the error for missing metadata file to stderr instead of stdout.

jameshadfield

👍

For our internal code it seems clear that we should move to the augur.io function. Currently export v1, refine, frequencies, titers and traits still use read_metadata. Each of them look simple to update using a similar pattern to what you've used here, e.g.:

# untested
metadata_df = read_metadata(args.metadata)
columns = metadata_df.columns
meta_tsv = metadata_df.to_dict(orient="index")
for strain in meta_tsv.keys():
    meta_tsv[strain]["strain"] = strain

We could then add a deprecation warning to the util function. In a future major release we can then remove:

augur.utils -> read_metadata
all of augur.util_support.metadata_file
tests/util_support/test_metadata_file.py

If this seems sensible we can add the above to a new issue to tackle.

jameshadfield · 2022-05-05T00:10:52Z

augur/export_v2.py

            sys.exit(2)
+        except Exception as error:


I'd add this to a list (or somesuch) as this shouldn't be caught once #903 is implemented. Maybe tagging the issue here is enough as it'll now show up on that issue...

Yeah, I exercised restraint by not just coding up #903 in this PR 😄 My issue with the implementation here is that the exception raised by read_metadata is an "expected" exception, but it isn't specific enough to catch by itself here without catching all other exceptions. We should really raise a custom IOException or similar instead.

huddlej added 3 commits April 28, 2022 14:39

Add missing files for functional tests

50c12a7

huddlej marked this pull request as ready for review May 4, 2022 23:13

huddlej requested review from corneliusroemer and jameshadfield May 4, 2022 23:13

jameshadfield approved these changes May 5, 2022

View reviewed changes

huddlej merged commit c610cf2 into master May 5, 2022

huddlej deleted the fix-metadata-in-export branch May 5, 2022 21:31

huddlej added this to the Patch release 15.0.2 milestone May 5, 2022

jameshadfield mentioned this pull request Jun 16, 2022

Document how to upgrade read_metadata #970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use io.read_metadata during export #909

Use io.read_metadata during export #909

huddlej commented Apr 28, 2022 •

edited

Loading

codecov bot commented May 4, 2022 •

edited

Loading

jameshadfield left a comment

jameshadfield May 5, 2022

huddlej May 5, 2022

Use io.read_metadata during export #909

Use io.read_metadata during export #909

Conversation

huddlej commented Apr 28, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

codecov bot commented May 4, 2022 • edited Loading

Codecov Report

jameshadfield left a comment

Choose a reason for hiding this comment

jameshadfield May 5, 2022

Choose a reason for hiding this comment

huddlej May 5, 2022

Choose a reason for hiding this comment

huddlej commented Apr 28, 2022 •

edited

Loading

codecov bot commented May 4, 2022 •

edited

Loading