io: Parse metadata with C engine, restrict to either CSV or TSV #812

victorlin · 2021-12-10T20:40:24Z

Description of proposed changes

See commit messages.

Related issue(s)

Thinking about this since I'm making similar changes for the augur filter database implementation.

Testing

Test added
Checks pass

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

fanninpm · 2022-01-05T19:29:01Z

You could also use csv.Sniffer.sniff() to return a Dialect, which can be passed in to pandas.read_csv() as the dialect keyword argument.

tsibley · 2022-02-02T19:10:56Z

The only concern would be if anyone is leveraging the Python parser for a non-tab-delimited file (e.g. metadata.csv). This would be a breaking change.

I'm not sure if this is an acceptable trade off. Diving into history, Augur's supported CSV for metadata since at least 2018 and delimiter sniffing since mid-2020.

@huddlej may have thoughts here too.

huddlej · 2022-02-02T23:04:37Z

I agree, @tsibley, that we can't drop CSV support. The original context for the current implementation is described in #574. There are two separate problems:

determining the delimiter of an input file
efficiently parsing a file given its delimiter

In the older Augur implementations, we addressed problem 1 by inspecting the extension of the input filename. This led to the problems in #574. We opted for the convenience of pandas's delimiter sniffer in Python parser mode, to solve this problem at the expense of a slower solution to problem 2.

As @fanninpm points out, we could use csv.Sniffer directly on the first line of the input file in read_metadata. Once we know the delimiter, we could call read_csv with that delimiter and use the C engine to parse. This would solve both problems without breaking backward compatibility.

victorlin · 2023-03-29T22:46:19Z

Thanks @fanninpm @tsibley @huddlej for the comments and suggestions. I finally updated this PR to use csv.Sniffer.

codecov · 2023-03-29T22:57:20Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (1dacac1) 68.39% compared to head (9f48ff2) 68.42%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #812      +/-   ##
==========================================
+ Coverage   68.39%   68.42%   +0.02%     
==========================================
  Files          63       63              
  Lines        6812     6818       +6     
  Branches     1671     1672       +1     
==========================================
+ Hits         4659     4665       +6     
  Misses       1843     1843              
  Partials      310      310

Impacted Files	Coverage Δ
augur/io/metadata.py	`96.17% <100.00%> (+0.15%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

joverlee521

LGTM!

victorlin · 2023-03-31T17:14:10Z

@joverlee521 I pushed a couple touch-ups (70d8150...740a330), do those LGTY?

joverlee521 · 2023-03-31T18:21:03Z

@victorlin Changes look good! (Ignoring the unrelated failing Cram test)

Previously, the delimiter could be anything arbitrary. However, all Augur subcommands that use this function only advertise compatibility with CSV and TSV. I don't think there's a good reason to support arbitrary delimiters.

The python engine was only used to detect the delimiter. Now that the delimiter is detected separately, use the C engine since it is faster.

Avoids re-defining this list at each use case and prevents them from getting out of sync.

victorlin requested a review from huddlej December 10, 2021 20:40

victorlin assigned huddlej and victorlin Dec 10, 2021

victorlin unassigned huddlej Jan 20, 2022

victorlin marked this pull request as draft March 30, 2022 18:56

victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from af42673 to 585c671 Compare March 29, 2023 22:20

victorlin changed the title ~~io: Parse metadata with C engine and tab separator~~ io: Parse metadata with C engine, restrict to either CSV or TSV Mar 29, 2023

victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from 585c671 to 44ee1ba Compare March 29, 2023 22:41

victorlin marked this pull request as ready for review March 29, 2023 22:48

victorlin requested a review from a team March 29, 2023 22:48

joverlee521 approved these changes Mar 31, 2023

View reviewed changes

victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from 2783ef8 to 740a330 Compare March 31, 2023 17:20

victorlin added 5 commits March 31, 2023 12:18

filter: Add test to show existing support for metadata delimiters

c0b13b4

read_metadata: Restrict possible delimiters when reading

74e7fe9

Previously, the delimiter could be anything arbitrary. However, all Augur subcommands that use this function only advertise compatibility with CSV and TSV. I don't think there's a good reason to support arbitrary delimiters.

read_metadata: Use the C engine for pandas.read_csv()

a90f0a5

The python engine was only used to detect the delimiter. Now that the delimiter is detected separately, use the C engine since it is faster.

Update changelog

04489f7

Use constant for valid delimiters

9f48ff2

Avoids re-defining this list at each use case and prevents them from getting out of sync.

victorlin force-pushed the victorlin/io/use-c-engine-parsing branch from 740a330 to 9f48ff2 Compare March 31, 2023 19:18

victorlin merged commit 73aad80 into master Mar 31, 2023

victorlin deleted the victorlin/io/use-c-engine-parsing branch March 31, 2023 20:37

victorlin mentioned this pull request Apr 6, 2023

Allow customization of input metadata delimiter #1196

Merged

3 tasks

victorlin mentioned this pull request Aug 9, 2024

Speed up augur filter without replacing Pandas #1573

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io: Parse metadata with C engine, restrict to either CSV or TSV #812

io: Parse metadata with C engine, restrict to either CSV or TSV #812

victorlin commented Dec 10, 2021 •

edited

Loading

fanninpm commented Jan 5, 2022

tsibley commented Feb 2, 2022

huddlej commented Feb 2, 2022

victorlin commented Mar 29, 2023

codecov bot commented Mar 29, 2023 •

edited

Loading

joverlee521 left a comment

victorlin commented Mar 31, 2023 •

edited

Loading

joverlee521 commented Mar 31, 2023

io: Parse metadata with C engine, restrict to either CSV or TSV #812

io: Parse metadata with C engine, restrict to either CSV or TSV #812

Conversation

victorlin commented Dec 10, 2021 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Checklist

fanninpm commented Jan 5, 2022

tsibley commented Feb 2, 2022

huddlej commented Feb 2, 2022

victorlin commented Mar 29, 2023

codecov bot commented Mar 29, 2023 • edited Loading

Codecov Report

joverlee521 left a comment

Choose a reason for hiding this comment

victorlin commented Mar 31, 2023 • edited Loading

joverlee521 commented Mar 31, 2023

victorlin commented Dec 10, 2021 •

edited

Loading

codecov bot commented Mar 29, 2023 •

edited

Loading

victorlin commented Mar 31, 2023 •

edited

Loading