Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLP query parsing mistakenly treats static text with wildcards as int and float #457

Open
haiqi96 opened this issue Jun 24, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@haiqi96
Copy link
Contributor

haiqi96 commented Jun 24, 2024

Bug

When CLP handles an ambigious token with wildcard attached, it seems to treat it as a float and int even if the token doesn't contain any numeric value.

After a quick investigation, it seems that the culprit is at

m_possible_types.push_back(Type::IntVar);

A fix that seems to work is to remove IntVar and FloatVar from possible type and set
m_cannot_convert_to_non_dict_var = true.
However, it would be nice to think carefully through the parsing logic to ensure there's no new corner cases introduced by the fix.

Some corner cases that we might need to check if they are properly handled

  1. float scientific notation
  2. hex (should have already been handled by CLP)
  3. -* and .*

Update 1:
CLP doesn't handle Scientific notation, so we don't need to consider this case.

is_var being false means that, the token doesn't contain any decimal number, nor it is a hex.
I would actually suggest renaming it to be is_definite_var

Here are the following corner case I can think of:

  1. "*text*.*"
  2. "*text*-*" (- can be a part of negative number)

Referring to the unit test, it looks like - and . are the only non-decimal chars that can be encoded.

Note, the . and - must be between two *. otherwise, they will be directly connected to a byte that is non-decimal(without a delimiter in between) and can only be treated as dictionary var.

In addition, CLP doesn't treat "-" + hex as encodedable int, so "text*-deadbeef" is not considered to contain an int.

In my opinion, to add a quick fix for this bug, we can add an extra check to see if the token contains any . or - surrounded by two *. if no, then it can be treated as a logtype or vardict only.

Update2:

The statement made in update 1 is not entirely correct, because we seem to not consider if a token containing a wildcard shall be splitted into two tokens. CLP always assume that there's only 1 token and uses a fallback mechanism (

} else if (query_token.has_greedy_wildcard_in_middle()) {
).

So perhaps the only corner case we need to consider is
"*-*.*"

CLP version

628ba3d

Environment

ubuntu 22.04

Reproduction steps

In CLP, instrument clg so it prints out all log types in its subqueries.

Compress any reasonable large logs.
Search for "FlowMonitor", or some random string (without any numeric value) surrounded by wildcard

Regardless of the matching results, you will notice CLP prints out every log type with a \INT in it.

@haiqi96 haiqi96 added the bug Something isn't working label Jun 24, 2024
@haiqi96 haiqi96 changed the title CLP query parsing mistakenly treats static text with wildcards as int CLP query parsing mistakenly treats static text with wildcards as int and float Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant