Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_metadata delimiter detection is fragile #574

Closed
huddlej opened this issue Jun 26, 2020 · 2 comments · Fixed by #587
Closed

read_metadata delimiter detection is fragile #574

huddlej opened this issue Jun 26, 2020 · 2 comments · Fixed by #587
Labels
bug Something isn't working easy problem Requires less work than most issues good first issue A relatively isolated issue appropriate for first-time contributors priority: moderate To be resolved after high priority issues

Comments

@huddlej
Copy link
Contributor

huddlej commented Jun 26, 2020

Current Behavior

read_metadata tries to load a given file as tab-delimited data only if the file extension is .tsv. Any other extensions get parsed as comma-separated values.

Expected behavior

read_metadata should provide a more user-friendly interface by more intelligently detecting the metadata delimiter. Users should be able to provide a tab-delimited metadata file like metadata.txt and have this data loaded properly. Similarly, the .tab extension is a common one for tab-delimited data, but files with this extension will get parsed as CSV files now.

How to reproduce

Steps to reproduce the current behavior:

  1. Create a tab-delimited metadata file named metadata.txt
  2. Run any augur command that loads metadata with this file (e.g., augur filter --sequences sequences.fasta --metadata metadata.txt --min-date 2010-01-01 --output filtered.fasta

Possible solution

Consider using pandas's builtin delimiter detection by setting sep=None and engine="python" when reading in metadata. Remove extension-based attempts to infer delimiters.

Using the python engine to load data is slightly slower than the C engine, but the user interface improvement is worth the performance cost.

Additional context

This issue was originally raised on the Nextstrain discussion board.

@huddlej huddlej added the bug Something isn't working label Jun 26, 2020
@huddlej huddlej added easy problem Requires less work than most issues good first issue A relatively isolated issue appropriate for first-time contributors priority: moderate To be resolved after high priority issues labels Jun 26, 2020
@swarris
Copy link

swarris commented Jun 29, 2020

I took me about a day and an inspection of the source code to find out why it would not load my tab delimited file.... Apparently because the code assumed a comma, not a tab... Without informing me. Also the documentation has no mentioned about this. Please remove this separator switch, or at least inform the user about this.

@huddlej
Copy link
Contributor Author

huddlej commented Jul 7, 2020

We're sorry about the headaches caused by the original delimiter detection, @swarris! This issue should now be resolved in the master branch and will be part of our next release (v 10.0.0). If you find that these changes don't fix the problem for your data, you can reopen this issue and we'll follow up with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working easy problem Requires less work than most issues good first issue A relatively isolated issue appropriate for first-time contributors priority: moderate To be resolved after high priority issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants