filter: `--query` fails when numerical comparisons are used on columns with missing values #1269

victorlin · 2023-07-28T19:37:28Z

Current Behavior

Given a metadata file with an "optional" numerical column coverage, any numerical query using coverage results in an error.

cat >metadata.tsv <<~~
strain	coverage
SEQ_1	0.94
SEQ_2	0.95
SEQ_3	0.96
SEQ_4	
~~

augur filter \
  --metadata metadata.tsv \
  --query "coverage >= 0.95" \
  --output-strains filtered_strains.txt
# ERROR: Internal Pandas error when applying query:
# 	'>=' not supported between instances of 'str' and 'float'
# Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.

This is because the dtype inference in augur.io.read_metadata does not support numerical columns with empty values, which is because it calls pandas.read_csv with na_filter=False.

Expected behavior

Missing values should be dropped in the filter.

augur filter \
  --metadata metadata.tsv \
  --query "coverage >= 0.95" \
  --output-strains filtered_strains.txt
# 2 strains were dropped during filtering
# 	2 of these were filtered out by the query: "coverage >= 0.95"
# 2 strains passed all filters

Possible solutions

Call pandas.read_csv with na_filter=True. This introduces other issues as noted by @huddlej in Slack.
Convert the numerical columns before applying the query: filter: Try converting all queried columns to numerical type #1268

The text was updated successfully, but these errors were encountered:

victorlin · 2023-07-31T18:03:37Z

There was some discussion in Slack around which solution to use.

@huddlej noted that na_filter=False was introduced as an alternative to setting the dtype of the date column to be string. The latter was recently added so it might be possible to revert back to the default behavior of na_filter=True. However, that was tested and requires additional tweaking in date parsing to get things working properly.

The general consensus was that we should use the "string" dtype for everything to avoid having to worry about how pandas automatically handles dtypes under the hood. With na_filter=False, numerical columns with missing values are already read as string, and #1268 works with that. #1252 is also reaching for the same goal.

victorlin added the bug Something isn't working label Jul 28, 2023

victorlin self-assigned this Jul 28, 2023

victorlin mentioned this issue Jul 28, 2023

filter: Try converting all queried columns to numerical type #1268

Merged

3 tasks

victorlin closed this as completed in #1268 Jul 31, 2023

huddlej mentioned this issue Jan 5, 2024

Use less dtype inference when reading metadata into DataFrames #1252

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: `--query` fails when numerical comparisons are used on columns with missing values #1269

filter: `--query` fails when numerical comparisons are used on columns with missing values #1269

victorlin commented Jul 28, 2023 •

edited

Loading

victorlin commented Jul 31, 2023

filter: --query fails when numerical comparisons are used on columns with missing values #1269

filter: --query fails when numerical comparisons are used on columns with missing values #1269

Comments

victorlin commented Jul 28, 2023 • edited Loading

Current Behavior

Expected behavior

Possible solutions

victorlin commented Jul 31, 2023

filter: `--query` fails when numerical comparisons are used on columns with missing values #1269

filter: `--query` fails when numerical comparisons are used on columns with missing values #1269

victorlin commented Jul 28, 2023 •

edited

Loading