Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use io.read_metadata during export #909

Merged
merged 4 commits into from
May 5, 2022
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Apr 28, 2022

Description of proposed changes

Replaces a call to the older utils.read_metadata function with the newer io.read_metadata function while processing metadata for export to an Auspice JSON. This new function returns a pandas DataFrame indexed by the first viable strain name column found in the metadata file (removing this column from the data itself), while the original function returns a dictionary indexed by strain name (keeping the original named column like strain or name in the data).

To avoid changing the downstream code that consumes the metadata, this commit converts the pandas DataFrame to a dictionary that matches the output of the original function. The main advantage here is that the calling code does not need to know what the id column is named, since io.read_metadata handles this and indexed the data frame by that column.

Importantly, the io.read_metadata function defines a list of accepted id columns to look for and will raise an exception with a helpful error message when the input metadata has no valid id columns. Commit 45df6bd adds a check for this exception in the augur export v2 command such that the user will get a helpful error message on the command line instead of an uncaught exception. An example from a new functional test demonstrates this behavior:

$ augur export v2 \
  --tree tree.nwk \
  --metadata dataset1_metadata_without_valid_id.tsv \
  --node-data div_node-data.json location_node-data.json \
  --auspice-config auspice_config1.json \
  --maintainers "Nextstrain Team" \
  --output test.json
ERROR: None of the possible id columns (('strain', 'name')) were found in the metadata's columns ('invalid_id', 'div', 'mutation_length')
$ echo $?
1

I preferred this implementation over the check for valid id columns in #906 because:

  1. it replaces an unmaintained read_metadata function with a maintained version,
  2. it does not duplicate information about valid id columns in the export module (this information is left to read_metadata),
  3. and it provides the user with a helpful error message when there are no valid id columns.

Related issue(s)

Fixes #905

Testing

  • Adds functional tests for the expected behavior of export v2 with metadata inputs.
  • Tested by CI

Replaces a call to the older `utils.read_metadata` function with the
newer `io.read_metadata` function while processing metadata for export
to an Auspice JSON. This new function returns a pandas DataFrame indexed
by the first viable strain name column found in the metadata
file (removing this column from the data itself), while the original
function returns a dictionary indexed by strain name (keeping the
original named column like `strain` or `name` in the data). To avoid
changing the downstream code that consumes the metadata, this commit
converts the pandas DataFrame to a dictionary that matches the output of
the original function. The main advantage here is that the calling code
does not need to know what the id column is named, since
`io.read_metadata` handles this and indexed the data frame by that
column.

This commit also adds functional tests for the expected behavior of
export v2 with metadata inputs.

Fixes #905
Splits up the first load of the metadata (where we check for available
id columns) into separate variables for metadata and the first chunk and
then explicitly closes the metadata file after the read. This close call
is required to prevent seemingly random "unclosed file" warnings from
appearing in stderr.
@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #909 (45df6bd) into master (4b71e7d) will increase coverage by 2.18%.
The diff coverage is 38.46%.

@@            Coverage Diff             @@
##           master     #909      +/-   ##
==========================================
+ Coverage   34.63%   36.81%   +2.18%     
==========================================
  Files          42       42              
  Lines        6006     6886     +880     
  Branches     1538     1959     +421     
==========================================
+ Hits         2080     2535     +455     
- Misses       3853     4221     +368     
- Partials       73      130      +57     
Impacted Files Coverage Δ
augur/export_v2.py 12.58% <20.00%> (+1.47%) ⬆️
augur/io.py 100.00% <100.00%> (ø)
augur/distance.py 46.21% <0.00%> (-2.15%) ⬇️
augur/tree.py 13.31% <0.00%> (+0.09%) ⬆️
augur/filter.py 71.02% <0.00%> (+0.64%) ⬆️
augur/align.py 90.16% <0.00%> (+1.17%) ⬆️
augur/titer_model.py 23.48% <0.00%> (+3.51%) ⬆️
augur/utils.py 50.00% <0.00%> (+6.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b71e7d...45df6bd. Read the comment docs.

Adds a functional test for how we'd like augur export v2 to behave when
the user provides metadata without a valid id and then updates the
try/except block in export_v2 to catch and print the error from
`read_metadata` when it occurs.

While we're at it, this commit also prints the error for missing
metadata file to stderr instead of stdout.
@huddlej huddlej marked this pull request as ready for review May 4, 2022 23:13
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

For our internal code it seems clear that we should move to the augur.io function. Currently export v1, refine, frequencies, titers and traits still use read_metadata. Each of them look simple to update using a similar pattern to what you've used here, e.g.:

# untested
metadata_df = read_metadata(args.metadata)
columns = metadata_df.columns
meta_tsv = metadata_df.to_dict(orient="index")
for strain in meta_tsv.keys():
    meta_tsv[strain]["strain"] = strain

We could then add a deprecation warning to the util function. In a future major release we can then remove:

  • augur.utils -> read_metadata
  • all of augur.util_support.metadata_file
  • tests/util_support/test_metadata_file.py

If this seems sensible we can add the above to a new issue to tackle.

sys.exit(2)
except Exception as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add this to a list (or somesuch) as this shouldn't be caught once #903 is implemented. Maybe tagging the issue here is enough as it'll now show up on that issue...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I exercised restraint by not just coding up #903 in this PR 😄 My issue with the implementation here is that the exception raised by read_metadata is an "expected" exception, but it isn't specific enough to catch by itself here without catching all other exceptions. We should really raise a custom IOException or similar instead.

@huddlej huddlej merged commit c610cf2 into master May 5, 2022
@huddlej huddlej deleted the fix-metadata-in-export branch May 5, 2022 21:31
@huddlej huddlej added this to the Patch release 15.0.2 milestone May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

ENH: Provide helpful error message when metadata file doesn't contain "strain" column
2 participants