Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905

Closed
corneliusroemer opened this issue Apr 26, 2022 · 2 comments · Fixed by #909
Closed
Labels
documentation Improvements or additions to documentation easy problem Requires less work than most issues enhancement New feature or request good first issue A relatively isolated issue appropriate for first-time contributors help wanted

Comments

@corneliusroemer
Copy link
Member

A lot of users seem to get the following type of error:

Job 3: Exporting data files for for auspice


        augur export v2             --tree results/global/tree.nwk             --metadata data/metadata.tsv
    --node-data results/global/branch_lengths.json results/global/nt_muts.json results/global/aa_muts.json results/global/subclades.json results/global/clades.json results/global/recency.json results/global/traits.json             --auspice-config my_profiles/covid/my_auspice_config.json             --include-root-sequence             --colors results/global/colors.tsv             --lat-longs defaults/lat_longs.tsv             --title 'Genomic epidemiology of novel coronavirus - Global subsampling'             --description my_profiles/covid/my_description.md             --output results/global/ncov_with_accessions.json 2>&1 | tee logs/export_global.txt

    Validating schema of 'results/global/aa_muts.json'...
    Traceback (most recent call last):
      File "/home/charbel/miniconda3/envs/nextstrain/bin/augur", line 10, in <module>
    sys.exit(main())
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export.py", line 22, in run
    return run_v2(args)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export_v2.py", line 903, in run_v2
    node_data, node_attrs, node_data_names, metadata_names = parse_node_data_and_metadata(T, args.node_data, args.metadata)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export_v2.py", line 863, in parse_node_data_and_metadata
    if node["strain"] in node_attrs: # i.e. this node name is in the tree
    KeyError: 'strain'

https://discussion.nextstrain.org/t/error-in-job-3-exporting-data-files-for-for-auspice/493/4

It's a common discussion topic on our forum and also in emails we get to hello@nextstrain.org

I think it would help users a lot if we raised a more informative error so that users know directly how to fix it.

Also, we don't seem to have documented the requirement that the metadata needs to contain a column called strain with strainnames.

Both should be addressed.

@corneliusroemer corneliusroemer added enhancement New feature or request help wanted documentation Improvements or additions to documentation good first issue A relatively isolated issue appropriate for first-time contributors easy problem Requires less work than most issues labels Apr 26, 2022
@corneliusroemer
Copy link
Member Author

Interestingly, when reading in a metadata file, we seem to be ok with name or strain but then in export we suddenly don't accept name anymore. That's strange.

Should we remove support for name or make export accept name to be in line with metadata_file.py, see:

class MetadataFile:
"""
Represents a CSV or TSV file containing metadata
The file must contain exactly one of a column named `strain` or `name`,
which is used to match metadata with samples.
"""

@huddlej
Copy link
Contributor

huddlej commented Apr 26, 2022

We do support searching for multiple arbitrary strain ids when reading in metadata with the read_metadata function in the io module. This function returns a data frame indexed by the first requested id column that exists in the input. As a result, the calling code can consume the data frame without needing to know what the name of the id column is.

An alternate solution to #906 is to use io.read_metadata in the export module instead of the current call to utils.read_metadata. We could cast the data frame to a dict to avoid changing other code in the module or we could update the logic in parse_node_data_and_metadata to use the data frame. We should really deprecate the utils.read_metadata function, anyway, since io.read_metadata was written to replace it eventually.

huddlej added a commit that referenced this issue Apr 28, 2022
Replaces a call to the older `utils.read_metadata` function with the
newer `io.read_metadata` function while processing metadata for export
to an Auspice JSON. This new function returns a pandas DataFrame indexed
by the first viable strain name column found in the metadata
file (removing this column from the data itself), while the original
function returns a dictionary indexed by strain name (keeping the
original named column like `strain` or `name` in the data). To avoid
changing the downstream code that consumes the metadata, this commit
converts the pandas DataFrame to a dictionary that matches the output of
the original function. The main advantage here is that the calling code
does not need to know what the id column is named, since
`io.read_metadata` handles this and indexed the data frame by that
column.

This commit also adds functional tests for the expected behavior of
export v2 with metadata inputs.

Fixes #905
huddlej added a commit that referenced this issue Apr 28, 2022
Replaces a call to the older `utils.read_metadata` function with the
newer `io.read_metadata` function while processing metadata for export
to an Auspice JSON. This new function returns a pandas DataFrame indexed
by the first viable strain name column found in the metadata
file (removing this column from the data itself), while the original
function returns a dictionary indexed by strain name (keeping the
original named column like `strain` or `name` in the data). To avoid
changing the downstream code that consumes the metadata, this commit
converts the pandas DataFrame to a dictionary that matches the output of
the original function. The main advantage here is that the calling code
does not need to know what the id column is named, since
`io.read_metadata` handles this and indexed the data frame by that
column.

This commit also adds functional tests for the expected behavior of
export v2 with metadata inputs.

Fixes #905
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation easy problem Requires less work than most issues enhancement New feature or request good first issue A relatively isolated issue appropriate for first-time contributors help wanted
Projects
No open projects
2 participants