-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Migrating --query
syntax from pandas to SQLite
#920
Comments
Thinking some more on this, I'm developing a preference towards option 1 with the
I will start updating #854 with this unless anyone has other opinions here. |
@victorlin I like this idea of the |
Late to the party, but I concur with the analysis and comments above, particularly @victorlin's rationale in #920 (comment). While I don't think it's necessarily worth our time, I think an "even better" option would be introducing an engine-neutral query syntax that can be implemented by all engines easily and used by most |
We met a few months ago and the decision was to maintain a single implementation of For compatibility, |
[note] This issue reflects a specific concern in #854. Discussion here will be used to refine that PR.
Context
The introduction of
--query
in version 8.0.0 allowed for complex queries against metadata that aren't handled by built-in filtering options. Here's a simple GitHub search showing some current usage of the feature.In #854, the internals of
augur filter
are rewritten using SQLite. This makes it difficult to maintain compatibility for current usage of--query
, which uses pandas under the hood to evaluate the expression. We should aim to transition users from pandas to SQLite expression syntax.Background - difference between pandas and SQLite expression syntax
The transition may be easier if pandas expression syntax was distinct from SQLite expression syntax. However, this is not the case and many operators overlap with slight differences. So when current queries are run with the new SQLite implementation, they don't fail but instead give inaccurate results, which is worse.
Example from running the genomic surveillance tutorial
using pandas augur filter (expected behavior):
using SQLite augur filter (inaccurate behavior):
With SQLite, the pandas expression resolves to
FALSE
and the query is effectively--exclude-all
. This is due to differences in operator meaning and order of operations:&
/|
are logical operators whereas in SQLite,&
/|
are bitwise AND/OR respectively. The difference isn't a problem for this example since in SQL,TRUE & TRUE
==TRUE AND TRUE
.==
,>=
). This is the main reason for inaccurate results in this example.Click to expand detailed examples from a SQLite prompt.
There are more examples, but the point here is: we should draw a clear line between pandas and SQLite query, and ensure users do not mix the two.
For reference:
engine
param, default isengine='numexpr'
. See numexpr operators docTasks
Possible solutions
Note that in the long-term, we may also support other database systems with varying expression syntax.
--engine
). Keep the current pandas implementation as the default--engine pandas
, but in "maintenance mode" with aDeprecationWarning
. Only add new features to--engine sqlite
.--query-engine
with default aspandas
, with the option to opt into the more efficient--query-engine sqlite
. This would:pandas.eval()
does not have great documentation or a interface that can be used to parse tokens from expressions for easy conversion.pandas.eval()
relies on either numexpr or native Python expression compilation. The support is wide, so the converter will also need wide support.The text was updated successfully, but these errors were encountered: