-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve curate metadata parser #1110
Conversation
Add an additional "authors" field with comma separated values to the testing metadata TSV file, showing how this causes the csv sniffer to fail to identify the delimiter for the file.
Only use the first line or header of the CSV/TSV file to determine the delimiter of the file. This prevents csv.Sniffer from failing to determine the delimiter when the data values include commas or tabs.
Codecov ReportBase: 63.37% // Head: 63.37% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## master #1110 +/- ##
=======================================
Coverage 63.37% 63.37%
=======================================
Files 57 57
Lines 6638 6638
Branches 1632 1632
=======================================
Hits 4207 4207
Misses 2147 2147
Partials 284 284
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Dare I ask what happens if some of your TSV column names contain commas?
@@ -138,7 +138,7 @@ def read_table_to_dict(table, duplicate_reporting=DataErrorMethod.ERROR_FIRST, i | |||
duplicate_ids = set() | |||
with open_file(table) as handle: | |||
# Get sample to determine delimiter | |||
table_sample = handle.read(1024) | |||
table_sample = handle.readline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and also, I don't think it's necessary since we don't limit reads later in this function, but we could preserve the 1KiB read limit with:
… = handle.readline(1024)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't think it would be necessary to add a size limit here since we read full records later anyways.
I dare not 😅 But if this actually becomes an issue, maybe we should add a |
In the end I dared… and it goes about as well as you expect (not well). So yeah, we'll need to either make the sniffer smarter (look at proportions of one delimiter to another) or allow manual override (probably best). |
Description of proposed changes
Improves the curate metadata parser by only using the first line or header of the CSV/TSV file to determine the delimiter of the file.
Prior to this change, the csv.Sniffer would fail when the data values in a TSV file include commas, such as when a metadata TSV file includes a column of comma separated author names.
Testing
Updated the curate metadata-input test to include an "authors" field in the testing TSV file.
Checklist