Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reading metadata as a pandas DataFrame #743

Merged
merged 2 commits into from
Jul 16, 2021
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jul 6, 2021

Description of proposed changes

This PR modifies the current read_metadata function in utils.py and the associated metadata_file.py module to allow users to read metadata as a pandas DataFrame instead of as a Python dictionary. We have used pandas to read our metadata for a long time now, but we have converted the intermediate DataFrame into a dictionary for backwards compatibility across Augur. Since Python dictionaries use memory less efficiently than DataFrames for this type of data, we would like the option to return the intermediate DataFrame as the final representation of the metadata.

Most changes in this PR involve updating the filter.py module to use this new DataFrame interface for its downstream filtering logic. The most complicated change here involves a refactor of the get_numerical_dates utility function to support dates from a DataFrame. While we could rewrite this function to use less redundant logic, we can leave that work for a later PR.

Note that this PR is the first step toward a refactor of filter.py described in #699. By using pandas DataFrames in the filter logic now, we can later rely on the DataFrame chunking interface to avoid reading all metadata into memory and still benefit from pandas' query, indexing, and data type-related functionality.

Related issue(s)

Related to #699

Testing

  • Tested by CI

@codecov
Copy link

codecov bot commented Jul 6, 2021

Codecov Report

Merging #743 (bb81de2) into master (89bc028) will decrease coverage by 0.00%.
The diff coverage is 44.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #743      +/-   ##
==========================================
- Coverage   31.06%   31.05%   -0.01%     
==========================================
  Files          41       41              
  Lines        5710     5731      +21     
  Branches     1389     1394       +5     
==========================================
+ Hits         1774     1780       +6     
- Misses       3861     3874      +13     
- Partials       75       77       +2     
Impacted Files Coverage Δ
augur/utils.py 40.17% <37.14%> (-1.32%) ⬇️
augur/filter.py 46.69% <40.90%> (ø)
augur/util_support/metadata_file.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89bc028...bb81de2. Read the comment docs.

Allows metadata to be returned as a pandas DataFrame instead of a
dictionary and uses this API to load metadata as a DataFrame in augur
filter. Updates the corresponding metadata references in augur filter
and related function calls to work with this new data structure.

This change allows us to eventually switch to chunking metadata input to
avoid reading it all into memory.
pylint complains about several stylistic issues with our code that are
always low priority (e.g., import order). This commit ignores some of
the most annoying warnings, so built-in linters (in emacs, etc.) are
more useful for catching real issues.
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huddlej and I talked about this offline. Summary is 👍.

@huddlej huddlej added this to the Feature release 13.0.0 milestone Jul 16, 2021
@huddlej huddlej merged commit 990e111 into master Jul 16, 2021
@huddlej huddlej deleted the pandas-metadata branch July 16, 2021 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants