Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: --query fails when numerical comparisons are used on columns with missing values #1269

Closed
victorlin opened this issue Jul 28, 2023 · 1 comment · Fixed by #1268
Closed
Assignees
Labels
bug Something isn't working

Comments

@victorlin
Copy link
Member

victorlin commented Jul 28, 2023

Current Behavior

Given a metadata file with an "optional" numerical column coverage, any numerical query using coverage results in an error.

cat >metadata.tsv <<~~
strain	coverage
SEQ_1	0.94
SEQ_2	0.95
SEQ_3	0.96
SEQ_4	
~~

augur filter \
  --metadata metadata.tsv \
  --query "coverage >= 0.95" \
  --output-strains filtered_strains.txt
# ERROR: Internal Pandas error when applying query:
# 	'>=' not supported between instances of 'str' and 'float'
# Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.

This is because the dtype inference in augur.io.read_metadata does not support numerical columns with empty values, which is because it calls pandas.read_csv with na_filter=False.

Expected behavior

Missing values should be dropped in the filter.

augur filter \
  --metadata metadata.tsv \
  --query "coverage >= 0.95" \
  --output-strains filtered_strains.txt
# 2 strains were dropped during filtering
# 	2 of these were filtered out by the query: "coverage >= 0.95"
# 2 strains passed all filters

Possible solutions

  1. Call pandas.read_csv with na_filter=True. This introduces other issues as noted by @huddlej in Slack.
  2. Convert the numerical columns before applying the query: filter: Try converting all queried columns to numerical type #1268
@victorlin victorlin added the bug Something isn't working label Jul 28, 2023
@victorlin victorlin self-assigned this Jul 28, 2023
@victorlin
Copy link
Member Author

There was some discussion in Slack around which solution to use.

@huddlej noted that na_filter=False was introduced as an alternative to setting the dtype of the date column to be string. The latter was recently added so it might be possible to revert back to the default behavior of na_filter=True. However, that was tested and requires additional tweaking in date parsing to get things working properly.

The general consensus was that we should use the "string" dtype for everything to avoid having to worry about how pandas automatically handles dtypes under the hood. With na_filter=False, numerical columns with missing values are already read as string, and #1268 works with that. #1252 is also reaching for the same goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant