Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: Parse metadata with C engine, restrict to either CSV or TSV #812

Merged
merged 5 commits into from
Mar 31, 2023

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Dec 10, 2021

Description of proposed changes

See commit messages.

Related issue(s)

Thinking about this since I'm making similar changes for the augur filter database implementation.

Testing

  • Test added
  • Checks pass

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@fanninpm
Copy link

fanninpm commented Jan 5, 2022

You could also use csv.Sniffer.sniff() to return a Dialect, which can be passed in to pandas.read_csv() as the dialect keyword argument.

@tsibley
Copy link
Member

tsibley commented Feb 2, 2022

The only concern would be if anyone is leveraging the Python parser for a non-tab-delimited file (e.g. metadata.csv). This would be a breaking change.

I'm not sure if this is an acceptable trade off. Diving into history, Augur's supported CSV for metadata since at least 2018 and delimiter sniffing since mid-2020.

@huddlej may have thoughts here too.

@huddlej
Copy link
Contributor

huddlej commented Feb 2, 2022

I agree, @tsibley, that we can't drop CSV support. The original context for the current implementation is described in #574. There are two separate problems:

  1. determining the delimiter of an input file
  2. efficiently parsing a file given its delimiter

In the older Augur implementations, we addressed problem 1 by inspecting the extension of the input filename. This led to the problems in #574. We opted for the convenience of pandas's delimiter sniffer in Python parser mode, to solve this problem at the expense of a slower solution to problem 2.

As @fanninpm points out, we could use csv.Sniffer directly on the first line of the input file in read_metadata. Once we know the delimiter, we could call read_csv with that delimiter and use the C engine to parse. This would solve both problems without breaking backward compatibility.

@victorlin victorlin marked this pull request as draft March 30, 2022 18:56
@victorlin victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from af42673 to 585c671 Compare March 29, 2023 22:20
@victorlin victorlin changed the title io: Parse metadata with C engine and tab separator io: Parse metadata with C engine, restrict to either CSV or TSV Mar 29, 2023
@victorlin victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from 585c671 to 44ee1ba Compare March 29, 2023 22:41
@victorlin
Copy link
Member Author

Thanks @fanninpm @tsibley @huddlej for the comments and suggestions. I finally updated this PR to use csv.Sniffer.

@victorlin victorlin marked this pull request as ready for review March 29, 2023 22:48
@victorlin victorlin requested a review from a team March 29, 2023 22:48
@codecov
Copy link

codecov bot commented Mar 29, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (1dacac1) 68.39% compared to head (9f48ff2) 68.42%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #812      +/-   ##
==========================================
+ Coverage   68.39%   68.42%   +0.02%     
==========================================
  Files          63       63              
  Lines        6812     6818       +6     
  Branches     1671     1672       +1     
==========================================
+ Hits         4659     4665       +6     
  Misses       1843     1843              
  Partials      310      310              
Impacted Files Coverage Δ
augur/io/metadata.py 96.17% <100.00%> (+0.15%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@victorlin
Copy link
Member Author

victorlin commented Mar 31, 2023

@joverlee521 I pushed a couple touch-ups (70d8150...740a330), do those LGTY?

@victorlin victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from 2783ef8 to 740a330 Compare March 31, 2023 17:20
@joverlee521
Copy link
Contributor

@victorlin Changes look good! (Ignoring the unrelated failing Cram test)

Previously, the delimiter could be anything arbitrary. However, all
Augur subcommands that use this function only advertise compatibility
with CSV and TSV. I don't think there's a good reason to support
arbitrary delimiters.
The python engine was only used to detect the delimiter. Now that the
delimiter is detected separately, use the C engine since it is faster.
Avoids re-defining this list at each use case and prevents them from
getting out of sync.
@victorlin victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from 740a330 to 9f48ff2 Compare March 31, 2023 19:18
@victorlin victorlin merged commit 73aad80 into master Mar 31, 2023
@victorlin victorlin deleted the victorlin/io/use-c-engine-parsing branch March 31, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants