Improve curate metadata parser #1110

joverlee521 · 2022-12-10T03:05:15Z

Description of proposed changes

Improves the curate metadata parser by only using the first line or header of the CSV/TSV file to determine the delimiter of the file.

Prior to this change, the csv.Sniffer would fail when the data values in a TSV file include commas, such as when a metadata TSV file includes a column of comma separated author names.

Testing

Updated the curate metadata-input test to include an "authors" field in the testing TSV file.

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

Add an additional "authors" field with comma separated values to the testing metadata TSV file, showing how this causes the csv sniffer to fail to identify the delimiter for the file.

Only use the first line or header of the CSV/TSV file to determine the delimiter of the file. This prevents csv.Sniffer from failing to determine the delimiter when the data values include commas or tabs.

codecov · 2022-12-10T03:17:24Z

Codecov Report

Base: 63.37% // Head: 63.37% // No change to project coverage 👍

Coverage data is based on head (0f740f3) compared to base (029c5bd).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1110   +/-   ##
=======================================
  Coverage   63.37%   63.37%           
=======================================
  Files          57       57           
  Lines        6638     6638           
  Branches     1632     1632           
=======================================
  Hits         4207     4207           
  Misses       2147     2147           
  Partials      284      284

Impacted Files	Coverage Δ
augur/io/metadata.py	`96.02% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

tsibley

This LGTM. Dare I ask what happens if some of your TSV column names contain commas?

tsibley · 2022-12-12T20:40:19Z

augur/io/metadata.py

@@ -138,7 +138,7 @@ def read_table_to_dict(table, duplicate_reporting=DataErrorMethod.ERROR_FIRST, i
    duplicate_ids = set()
    with open_file(table) as handle:
        # Get sample to determine delimiter
-        table_sample = handle.read(1024)
+        table_sample = handle.readline()


Oh and also, I don't think it's necessary since we don't limit reads later in this function, but we could preserve the 1KiB read limit with:

… = handle.readline(1024)

Yeah, I didn't think it would be necessary to add a size limit here since we read full records later anyways.

joverlee521 · 2022-12-12T20:55:51Z

Dare I ask what happens if some of your TSV column names contain commas?

I dare not 😅 But if this actually becomes an issue, maybe we should add a --metadata-format option to allow users to override the csv.Sniffer.

tsibley · 2022-12-12T23:06:14Z

In the end I dared… and it goes about as well as you expect (not well). So yeah, we'll need to either make the sniffer smarter (look at proportions of one delimiter to another) or allow manual override (probably best).

joverlee521 added 2 commits December 9, 2022 18:50

tests: Failing test for comma separated data in curate metadata TSV

da845f4

Add an additional "authors" field with comma separated values to the testing metadata TSV file, showing how this causes the csv sniffer to fail to identify the delimiter for the file.

read_table_to_dict: only sample the first line

04f1e4c

Only use the first line or header of the CSV/TSV file to determine the delimiter of the file. This prevents csv.Sniffer from failing to determine the delimiter when the data values include commas or tabs.

joverlee521 requested a review from a team December 10, 2022 03:05

Update changelog

0f740f3

tsibley approved these changes Dec 12, 2022

View reviewed changes

tsibley reviewed Dec 12, 2022

View reviewed changes

joverlee521 merged commit e5d7679 into master Dec 12, 2022

joverlee521 deleted the curate-metadata-input branch December 12, 2022 21:27

joverlee521 mentioned this pull request Dec 16, 2022

augur db with import/export of metadata #1094

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve curate metadata parser #1110

Improve curate metadata parser #1110

joverlee521 commented Dec 10, 2022 •

edited

Loading

codecov bot commented Dec 10, 2022 •

edited

Loading

tsibley left a comment •

edited

Loading

tsibley Dec 12, 2022

joverlee521 Dec 12, 2022

joverlee521 commented Dec 12, 2022

tsibley commented Dec 12, 2022

Improve curate metadata parser #1110

Improve curate metadata parser #1110

Conversation

joverlee521 commented Dec 10, 2022 • edited Loading

Description of proposed changes

Testing

Checklist

codecov bot commented Dec 10, 2022 • edited Loading

Codecov Report

tsibley left a comment • edited Loading

Choose a reason for hiding this comment

tsibley Dec 12, 2022

Choose a reason for hiding this comment

joverlee521 Dec 12, 2022

Choose a reason for hiding this comment

joverlee521 commented Dec 12, 2022

tsibley commented Dec 12, 2022

joverlee521 commented Dec 10, 2022 •

edited

Loading

codecov bot commented Dec 10, 2022 •

edited

Loading

tsibley left a comment •

edited

Loading