Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: Add timeout for remote rule evaluation queries. #2090

Merged
merged 4 commits into from
Jun 13, 2022

Conversation

stevesg
Copy link
Contributor

@stevesg stevesg commented Jun 13, 2022

During a rule group evaluation, if the remote query never completes for any
reason, then the rule group will become stuck and never evaluate again.

During a rule group evaluation, if the remote query never completes for any
reason, then the rule group will become stuck and never evaluate again.
@stevesg stevesg force-pushed the ruler-remote-query-timeout branch from e83ed94 to d4e5a3e Compare June 13, 2022 13:39
@stevesg stevesg marked this pull request as ready for review June 13, 2022 13:40
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I was wondering if we should set timeout parameter when running the query, but setting context deadline works too thanks to using httpgrpc. Furthermore query-frontend doesn't even seem to be using timeout parameter (which looks like bug, but for separate PR).

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -211,6 +222,9 @@ func (q *RemoteQuerier) query(ctx context.Context, query string, ts time.Time, l
}
}

ctx, cancel := context.WithTimeout(ctx, q.timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this whole block :225-:236 into a new function and then call it from query() and Read(), otherwise this portion is duplicate and this would also make it easier to reason about where the timeout starts to apply.

Potentially the initialization of the middlewares could also be in that new function, but I'm not sure if this wouldn't dilute it's scope too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually how I had it to start with, but it felt over complicated. I'll discuss with @ortuman if we want to do any refactoring here in a follow up PR.

@stevesg stevesg merged commit e42496e into main Jun 13, 2022
@stevesg stevesg deleted the ruler-remote-query-timeout branch June 13, 2022 18:09
@@ -1400,6 +1400,10 @@ query_frontend:
# CLI flag: -ruler.query-frontend.address
[address: <string> | default = ""]

# The timeout for a rule query being evaluated by the query-frontend.
# CLI flag: -ruler.query-frontend.timeout
[timeout: <duration> | default = 2m]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced about having yet another timeout config option. Why not reusing -querier.timeout, documenting that needs to be set for ruler too when remote rule evaluation is used? We already use it in query-frontend too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: #2129

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-querier.timeout is poorly named (perhaps "-query.timeout" would be better) but it would make sense to reuse it, since ruler already uses it when it's not configured to use query-frontend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's poorly named, but unfortunately changing it would be a breaking change (so to rename we would have to go through a deprecation process keeping backward compatibility for 2 releases). For the sake of config reduction, I think just reusing it is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants