Enable reading metadata as a pandas DataFrame #743
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
This PR modifies the current
read_metadata
function inutils.py
and the associatedmetadata_file.py
module to allow users to read metadata as a pandas DataFrame instead of as a Python dictionary. We have used pandas to read our metadata for a long time now, but we have converted the intermediate DataFrame into a dictionary for backwards compatibility across Augur. Since Python dictionaries use memory less efficiently than DataFrames for this type of data, we would like the option to return the intermediate DataFrame as the final representation of the metadata.Most changes in this PR involve updating the
filter.py
module to use this new DataFrame interface for its downstream filtering logic. The most complicated change here involves a refactor of theget_numerical_dates
utility function to support dates from a DataFrame. While we could rewrite this function to use less redundant logic, we can leave that work for a later PR.Note that this PR is the first step toward a refactor of
filter.py
described in #699. By using pandas DataFrames in the filter logic now, we can later rely on the DataFrame chunking interface to avoid reading all metadata into memory and still benefit from pandas'query
, indexing, and data type-related functionality.Related issue(s)
Related to #699
Testing