Enable reading metadata as a pandas DataFrame #743

huddlej · 2021-07-06T19:27:32Z

Description of proposed changes

This PR modifies the current read_metadata function in utils.py and the associated metadata_file.py module to allow users to read metadata as a pandas DataFrame instead of as a Python dictionary. We have used pandas to read our metadata for a long time now, but we have converted the intermediate DataFrame into a dictionary for backwards compatibility across Augur. Since Python dictionaries use memory less efficiently than DataFrames for this type of data, we would like the option to return the intermediate DataFrame as the final representation of the metadata.

Most changes in this PR involve updating the filter.py module to use this new DataFrame interface for its downstream filtering logic. The most complicated change here involves a refactor of the get_numerical_dates utility function to support dates from a DataFrame. While we could rewrite this function to use less redundant logic, we can leave that work for a later PR.

Note that this PR is the first step toward a refactor of filter.py described in #699. By using pandas DataFrames in the filter logic now, we can later rely on the DataFrame chunking interface to avoid reading all metadata into memory and still benefit from pandas' query, indexing, and data type-related functionality.

Related issue(s)

Related to #699

Testing

Tested by CI

codecov · 2021-07-06T19:28:50Z

Codecov Report

Merging #743 (bb81de2) into master (89bc028) will decrease coverage by 0.00%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##           master     #743      +/-   ##
==========================================
- Coverage   31.06%   31.05%   -0.01%     
==========================================
  Files          41       41              
  Lines        5710     5731      +21     
  Branches     1389     1394       +5     
==========================================
+ Hits         1774     1780       +6     
- Misses       3861     3874      +13     
- Partials       75       77       +2

Impacted Files	Coverage Δ
augur/utils.py	`40.17% <37.14%> (-1.32%)`	⬇️
augur/filter.py	`46.69% <40.90%> (ø)`
augur/util_support/metadata_file.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89bc028...bb81de2. Read the comment docs.

Allows metadata to be returned as a pandas DataFrame instead of a dictionary and uses this API to load metadata as a DataFrame in augur filter. Updates the corresponding metadata references in augur filter and related function calls to work with this new data structure. This change allows us to eventually switch to chunking metadata input to avoid reading it all into memory.

pylint complains about several stylistic issues with our code that are always low priority (e.g., import order). This commit ignores some of the most annoying warnings, so built-in linters (in emacs, etc.) are more useful for catching real issues.

tsibley

@huddlej and I talked about this offline. Summary is 👍.

huddlej requested review from joverlee521 and tsibley July 6, 2021 21:25

huddlej modified the milestone: Feature release 13.0.0 Jul 6, 2021

This was referenced Jul 6, 2021

Reduce metadata memory usage in augur filter #699

Closed

Refactor filters into separate functions #745

Merged

Refactor subsampling into its own function #746

Merged

huddlej added 2 commits July 15, 2021 12:03

Ignore spurious pylint warnings

bb81de2

pylint complains about several stylistic issues with our code that are always low priority (e.g., import order). This commit ignores some of the most annoying warnings, so built-in linters (in emacs, etc.) are more useful for catching real issues.

huddlej force-pushed the pandas-metadata branch from 76b2c86 to bb81de2 Compare July 15, 2021 19:03

tsibley approved these changes Jul 15, 2021

View reviewed changes

huddlej added this to the Feature release 13.0.0 milestone Jul 16, 2021

huddlej merged commit 990e111 into master Jul 16, 2021

huddlej deleted the pandas-metadata branch July 16, 2021 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable reading metadata as a pandas DataFrame #743

Enable reading metadata as a pandas DataFrame #743

huddlej commented Jul 6, 2021

codecov bot commented Jul 6, 2021 •

edited

Loading

tsibley left a comment

Enable reading metadata as a pandas DataFrame #743

Enable reading metadata as a pandas DataFrame #743

Conversation

huddlej commented Jul 6, 2021

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Jul 6, 2021 • edited Loading

Codecov Report

tsibley left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 6, 2021 •

edited

Loading