read_metadata delimiter detection is fragile #574

huddlej · 2020-06-26T18:06:24Z

Current Behavior

read_metadata tries to load a given file as tab-delimited data only if the file extension is .tsv. Any other extensions get parsed as comma-separated values.

Expected behavior

read_metadata should provide a more user-friendly interface by more intelligently detecting the metadata delimiter. Users should be able to provide a tab-delimited metadata file like metadata.txt and have this data loaded properly. Similarly, the .tab extension is a common one for tab-delimited data, but files with this extension will get parsed as CSV files now.

How to reproduce

Steps to reproduce the current behavior:

Create a tab-delimited metadata file named metadata.txt
Run any augur command that loads metadata with this file (e.g., augur filter --sequences sequences.fasta --metadata metadata.txt --min-date 2010-01-01 --output filtered.fasta

Possible solution

Consider using pandas's builtin delimiter detection by setting sep=None and engine="python" when reading in metadata. Remove extension-based attempts to infer delimiters.

Using the python engine to load data is slightly slower than the C engine, but the user interface improvement is worth the performance cost.

Additional context

This issue was originally raised on the Nextstrain discussion board.

The text was updated successfully, but these errors were encountered:

swarris · 2020-06-29T13:12:30Z

I took me about a day and an inspection of the source code to find out why it would not load my tab delimited file.... Apparently because the code assumed a comma, not a tab... Without informing me. Also the documentation has no mentioned about this. Please remove this separator switch, or at least inform the user about this.

huddlej · 2020-07-07T17:30:17Z

We're sorry about the headaches caused by the original delimiter detection, @swarris! This issue should now be resolved in the master branch and will be part of our next release (v 10.0.0). If you find that these changes don't fix the problem for your data, you can reopen this issue and we'll follow up with you.

huddlej added the bug Something isn't working label Jun 26, 2020

huddlej added easy problem Requires less work than most issues good first issue A relatively isolated issue appropriate for first-time contributors priority: moderate To be resolved after high priority issues labels Jun 26, 2020

This was referenced Jun 30, 2020

Break out utils.read_metadata to new class #584

Merged

Metadata delimiter autodetection #587

Merged

huddlej closed this as completed in #587 Jul 7, 2020

huddlej mentioned this issue Feb 2, 2022

io: Parse metadata with C engine, restrict to either CSV or TSV #812

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_metadata delimiter detection is fragile #574

read_metadata delimiter detection is fragile #574

huddlej commented Jun 26, 2020

swarris commented Jun 29, 2020

huddlej commented Jul 7, 2020

read_metadata delimiter detection is fragile #574

read_metadata delimiter detection is fragile #574

Comments

huddlej commented Jun 26, 2020

swarris commented Jun 29, 2020

huddlej commented Jul 7, 2020