read_metadata delimiter detection is fragile #574
Labels
bug
Something isn't working
easy problem
Requires less work than most issues
good first issue
A relatively isolated issue appropriate for first-time contributors
priority: moderate
To be resolved after high priority issues
Current Behavior
read_metadata
tries to load a given file as tab-delimited data only if the file extension is.tsv
. Any other extensions get parsed as comma-separated values.Expected behavior
read_metadata
should provide a more user-friendly interface by more intelligently detecting the metadata delimiter. Users should be able to provide a tab-delimited metadata file likemetadata.txt
and have this data loaded properly. Similarly, the.tab
extension is a common one for tab-delimited data, but files with this extension will get parsed as CSV files now.How to reproduce
Steps to reproduce the current behavior:
metadata.txt
augur filter --sequences sequences.fasta --metadata metadata.txt --min-date 2010-01-01 --output filtered.fasta
Possible solution
Consider using pandas's builtin delimiter detection by setting
sep=None
andengine="python"
when reading in metadata. Remove extension-based attempts to infer delimiters.Using the python engine to load data is slightly slower than the C engine, but the user interface improvement is worth the performance cost.
Additional context
This issue was originally raised on the Nextstrain discussion board.
The text was updated successfully, but these errors were encountered: