Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use io.read_metadata during export #909

Merged
merged 4 commits into from
May 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions augur/export_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
import numbers
import re
from Bio import Phylo
from .utils import read_metadata, read_node_data, write_json, read_config, read_lat_longs, read_colors

from .io import read_metadata
from .utils import read_node_data, write_json, read_config, read_lat_longs, read_colors
from .validate import export_v2 as validate_v2, auspice_config_v2 as validate_auspice_config_v2, ValidateError

# Set up warnings & exceptions
Expand Down Expand Up @@ -992,10 +994,16 @@ def run_v2(args):

if args.metadata is not None:
try:
metadata_file, _ = read_metadata(args.metadata)
metadata_file = read_metadata(args.metadata).to_dict(orient="index")
for strain in metadata_file.keys():
if "strain" not in metadata_file[strain]:
metadata_file[strain]["strain"] = strain
except FileNotFoundError:
print(f"ERROR: meta data file ({args.metadata}) does not exist")
print(f"ERROR: meta data file ({args.metadata}) does not exist", file=sys.stderr)
sys.exit(2)
except Exception as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add this to a list (or somesuch) as this shouldn't be caught once #903 is implemented. Maybe tagging the issue here is enough as it'll now show up on that issue...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I exercised restraint by not just coding up #903 in this PR 😄 My issue with the implementation here is that the exception raised by read_metadata is an "expected" exception, but it isn't specific enough to catch by itself here without catching all other exceptions. We should really raise a custom IOException or similar instead.

print(f"ERROR: {error}", file=sys.stderr)
sys.exit(1)
else:
metadata_file = {}

Expand Down
6 changes: 4 additions & 2 deletions augur/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,13 @@ def read_metadata(metadata_file, id_columns=("strain", "name"), chunk_size=None)
kwargs["chunksize"] = chunk_size

# Inspect the first chunk of the metadata, to find any valid index columns.
chunk = pd.read_csv(
metadata = pd.read_csv(
metadata_file,
iterator=True,
**kwargs,
).read(nrows=1)
)
chunk = metadata.read(nrows=1)
metadata.close()

id_columns_present = [
id_column
Expand Down
46 changes: 46 additions & 0 deletions tests/functional/export_v2.t
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,49 @@ Export with auspice config JSON with an extensions block
$ python3 "$TESTDIR/../../scripts/diff_jsons.py" export_v2/dataset2.json "$TMP/dataset3.json" \
> --exclude-paths "root['meta']['updated']"
{}

Run export with metadata using the default id column of "strain".

$ ${AUGUR} export v2 \
> --tree export_v2/tree.nwk \
> --metadata export_v2/dataset1_metadata_with_strain.tsv \
> --node-data export_v2/div_node-data.json export_v2/location_node-data.json \
> --auspice-config export_v2/auspice_config1.json \
> --maintainers "Nextstrain Team" \
> --output "$TMP/dataset1.json" > /dev/null

$ python3 "$TESTDIR/../../scripts/diff_jsons.py" export_v2/dataset1.json "$TMP/dataset1.json" \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}
$ rm -f "$TMP/dataset1.json"

Run export with metadata that uses a different id column other than "strain".
In this case, the column is "name" (one of the default columns expected by Augur's `io.read_metadata` function).

$ ${AUGUR} export v2 \
> --tree export_v2/tree.nwk \
> --metadata export_v2/dataset1_metadata_with_name.tsv \
> --node-data export_v2/div_node-data.json export_v2/location_node-data.json \
> --auspice-config export_v2/auspice_config1.json \
> --maintainers "Nextstrain Team" \
> --output "$TMP/dataset1.json" > /dev/null

$ python3 "$TESTDIR/../../scripts/diff_jsons.py" export_v2/dataset1.json "$TMP/dataset1.json" \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}
$ rm -f "$TMP/dataset1.json"

Run export with metadata that uses an invalid id column.
This should fail with a helpful error message.

$ ${AUGUR} export v2 \
> --tree export_v2/tree.nwk \
> --metadata export_v2/dataset1_metadata_without_valid_id.tsv \
> --node-data export_v2/div_node-data.json export_v2/location_node-data.json \
> --auspice-config export_v2/auspice_config1.json \
> --maintainers "Nextstrain Team" \
> --output "$TMP/dataset1.json" > /dev/null
ERROR: None of the possible id columns (('strain', 'name')) were found in the metadata's columns ('invalid_id', 'div', 'mutation_length')
[1]

$ popd > /dev/null
7 changes: 7 additions & 0 deletions tests/functional/export_v2/dataset1_metadata_with_name.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name div mutation_length
tipA 1 1
tipB 3 1
tipC 3 1
tipD 8 3
tipE 9 4
tipF 6 1
7 changes: 7 additions & 0 deletions tests/functional/export_v2/dataset1_metadata_with_strain.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
strain div mutation_length
tipA 1 1
tipB 3 1
tipC 3 1
tipD 8 3
tipE 9 4
tipF 6 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
invalid_id div mutation_length
tipA 1 1
tipB 3 1
tipC 3 1
tipD 8 3
tipE 9 4
tipF 6 1