Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve curate metadata parser #1110

Merged
merged 3 commits into from
Dec 12, 2022
Merged

Improve curate metadata parser #1110

merged 3 commits into from
Dec 12, 2022

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Dec 10, 2022

Description of proposed changes

Improves the curate metadata parser by only using the first line or header of the CSV/TSV file to determine the delimiter of the file.

Prior to this change, the csv.Sniffer would fail when the data values in a TSV file include commas, such as when a metadata TSV file includes a column of comma separated author names.

Testing

Updated the curate metadata-input test to include an "authors" field in the testing TSV file.

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

Add an additional "authors" field with comma separated values to the
testing metadata TSV file, showing how this causes the csv sniffer
to fail to identify the delimiter for the file.
Only use the first line or header of the CSV/TSV file to determine the
delimiter of the file. This prevents csv.Sniffer from failing to
determine the delimiter when the data values include commas or tabs.
@joverlee521 joverlee521 requested a review from a team December 10, 2022 03:05
@codecov
Copy link

codecov bot commented Dec 10, 2022

Codecov Report

Base: 63.37% // Head: 63.37% // No change to project coverage 👍

Coverage data is based on head (0f740f3) compared to base (029c5bd).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1110   +/-   ##
=======================================
  Coverage   63.37%   63.37%           
=======================================
  Files          57       57           
  Lines        6638     6638           
  Branches     1632     1632           
=======================================
  Hits         4207     4207           
  Misses       2147     2147           
  Partials      284      284           
Impacted Files Coverage Δ
augur/io/metadata.py 96.02% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Dare I ask what happens if some of your TSV column names contain commas?

@@ -138,7 +138,7 @@ def read_table_to_dict(table, duplicate_reporting=DataErrorMethod.ERROR_FIRST, i
duplicate_ids = set()
with open_file(table) as handle:
# Get sample to determine delimiter
table_sample = handle.read(1024)
table_sample = handle.readline()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and also, I don't think it's necessary since we don't limit reads later in this function, but we could preserve the 1KiB read limit with:

= handle.readline(1024)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't think it would be necessary to add a size limit here since we read full records later anyways.

@joverlee521
Copy link
Contributor Author

Dare I ask what happens if some of your TSV column names contain commas?

I dare not 😅 But if this actually becomes an issue, maybe we should add a --metadata-format option to allow users to override the csv.Sniffer.

@joverlee521 joverlee521 merged commit e5d7679 into master Dec 12, 2022
@joverlee521 joverlee521 deleted the curate-metadata-input branch December 12, 2022 21:27
@tsibley
Copy link
Member

tsibley commented Dec 12, 2022

In the end I dared… and it goes about as well as you expect (not well). So yeah, we'll need to either make the sniffer smarter (look at proportions of one delimiter to another) or allow manual override (probably best).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants