Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: remove default exist condition for groupby values in favour of … #5420

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

nityanandagohain
Copy link
Member

Fixes #5136

it removes the default exists caluses added to the group by values.

@github-actions github-actions bot added the enhancement New feature or request label Jul 3, 2024
@srikanthccv
Copy link
Member

LGTM, can you make the changes for traces as well?

@nityanandagohain
Copy link
Member Author

have removed it from traces as well

srikanthccv
srikanthccv previously approved these changes Jul 3, 2024
Copy link
Member

@srikanthccv srikanthccv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@ankitnayan
Copy link
Collaborator

The PR should have included these 2 points too IMO

  1. Why was this there in the 1st place? I guess perf would be impacted now. If so, we must have some numbers with the PR
  2. Is the _exist column not needed anymore? I see them being dereferenced

@srikanthccv
Copy link
Member

Why was this there in the 1st place?

It was an incorrect assumption when the original code was written that group by keys must exist #2571 (comment). However, it returns results that don't make sense. For example, if we group by service.name and see the result in the table, all those logs that don't have the service.name set are not shown in the result table which is misleading.

Is the _exist column not needed anymore? I see them being dereferenced

These columns mainly exist to support the EXIST and NOT EXISTS operators.

@srikanthccv srikanthccv dismissed their stale review July 5, 2024 03:47

can produce incorrect results for legitimate zero values as it was origianally intended for

@srikanthccv
Copy link
Member

I see two things we can do it address the problem. This is only an issue for COUNT aggregations since it doesn't require any aggregate attribute.

  1. Run two queries
    • the first one will be the same as we currently do
    • the second one will query for the data excluding the first one (the actual query details can be figured out) and show them under something like "n/a" or "empty" to indicate there are entries in the result that don't account for any of the selected group by combinations.

This will ensure we don't overcount things for legitimate zero-value scenarios and at the same time don't produce misleading results when the group key doesn't exist.

  1. Also nudging users in the product to somehow guide COUNT without aggregate attribute is COUNT(*) which is different from COUNT(key).

@ankitnayan
Copy link
Collaborator

ankitnayan commented Jul 7, 2024

Run two queries

Inefficient?

Also nudging users in the product

Would be still problematic for logs?

What about having a default integer value and removing that in the where clause of the query?

@nityanandagohain
Copy link
Member Author

nityanandagohain commented Jul 8, 2024

@ankitnayan this is why we didn't go with default values last time

#2594 (comment)

#2594 (comment)

@srikanthccv
Copy link
Member

I summarized it below.

Here's a scenario where our current query logic falls short. Imagine an application service that logs data with various attributes, including code and service resource information. The service always includes a 'service.name' by default, but 'code.function' only appears in certain parts of the codebase.

Now, when a user tries to COUNT logs WHERE severity_text = ERROR, grouped by service.name and code.function, we run into issues. The query looks something like this:

SELECT
    toStartOfInterval(fromUnixTimestamp64Nano(timestamp), toIntervalSecond(60)) AS ts,
    resources_string_value[indexOf(resources_string_key, 'service.name')] AS `service.name`,
    attributes_string_value[indexOf(attributes_string_key, 'code.function')] AS `code.function`,
    toFloat64(count(*)) AS value
FROM signoz_logs.distributed_logs
WHERE ((timestamp >= 1720511534000000000) AND (timestamp <= 1720513334000000000)) AND severity_text = 'ERROR' AND has(attributes_string_key, 'service.name') AND has(attributes_string_key, 'code.function')
GROUP BY
    `service.name`,
    `code.function`,
    ts
ORDER BY value DESC

The problem? The WHERE clause filters out all logs missing either service.name or code.function in their attribute keys. This leads to inaccurate results because some ERROR logs might lack the 'code.function' attribute due to incomplete instrumentation or other reasons. Users expect to see these logs with the missing attribute as well. What is the best way to show it can be discussed

So, does it make sense to exclude logs without certain attribute keys? Not really. This default behaviour skews results in common use cases, and there's no easy workaround except writing a custom ClickHouse query.

Would it be better to include these logs under a special no-value or something group? I'd argue yes. It shows users that some error logs exist without a code.function, giving a more complete picture.

Is this always the right approach? Probably not. There might be cases where a missing attribute means the log isn't relevant. In such scenarios, we'd need a way to explicitly exclude these logs using a <key> EXISTS filter.

Now, let's consider numeric and boolean attributes. Take HTTP server logs with an http.status_code or grpc.code for the response. When grouping by http.status_code, users typically only care about logs with this attribute. Logs without it are usually internal and irrelevant. Here, our current approach actually gives the expected result.

Clearly, we can't make a one-size-fits-all decision. The context and nature of the data being queried matter. So, removing the default has conditions from the WHERE clause seems logical.

But simply removing these conditions changes our query to:

SELECT
    toStartOfInterval(fromUnixTimestamp64Nano(timestamp), toIntervalSecond(60)) AS ts,
    resources_string_value[indexOf(resources_string_key, 'service.name')] AS `service.name`,
    attributes_string_value[indexOf(attributes_string_key, 'code.function')] AS `code.function`,
    toFloat64(count(*)) AS value
FROM signoz_logs.distributed_logs
WHERE ((timestamp >= 1720511534000000000) AND (timestamp <= 1720513334000000000)) AND severity_text = 'ERROR'
GROUP BY
    `service.name`,
    `code.function`,
    ts
ORDER BY value DESC

The consequence? For logs without a key in the attribute array, we get zero-values: an empty string for strings, 0 for numbers, and false for booleans. This creates a new problem: legitimate zero values in user attributes (like grpc.code) get mixed up with our placeholder zeros. We end up overcounting when trying to show the count for spans without grpc.code.

Can we just add our own default values for missing keys? Nope, because we risk colliding with users' actual values for numbers or booleans. We can't predict what values people might use in their attributes.

One of the suggestions was to run two queries and the concern is it reads twice the data. We can avoid two queries by doing conditional counting. Something like. We should check if there is any performance change b/w this and the two queries approach.

SELECT
    countIf(mapContains(numberTagMap, 'http.status_code')) as real_count,
    count() as total_count,
    numberTagMap['http.status_code'],
    total_count-real_count as no_value_count
FROM signoz_traces.signoz_index_v2
WHERE timestamp >= (now() - toIntervalHour(10))
GROUP BY numberTagMap['http.status_code']

How do we make this backwards compatible? Since people have already set up dashboards/alerts 'we mustn't change their existing behaviour. We need to write a data migration that adds a filter item to the panel WHERE clause for each group by key. For every new panel, users with more context make the judgement and add the <key> EXISTS filter as needed.

Overall, I believe the fix in general should be implemented but how we handle the legitimate zero-values is the only concern.

@nityanandagohain nityanandagohain mentioned this pull request Jul 25, 2024
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove the default exists condition for group by keys?
3 participants