Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] log error when ES Query rules find docs out of time range #186332

Merged
merged 6 commits into from
Jul 23, 2024

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Jun 17, 2024

resolves #175980

Summary

Adds a check with logging if an ES Query rule returns hits which fall outside the time range it's searching. This shouldn't ever happen, but seems to be happening on rare occaisons, so we wanted to add some diagnostics to try to help narrow down the problem.

Note that the ES|QL flavor rule does not use this diagnostic, just search source (KQL) and query dsl.

We check 3 things:

  • ensure the dateStart sent to fetch was valid
  • ensure the dateEnd sent to fetch was valid
  • ensure the relevant time fields in hits are within the dateStart/dateEnd range

These produce three different error messages:

For rule '<rule-id>', hits were returned with invalid time range start date '<date>' from field '<field>' using query <query>

For rule '<rule-id>', hits were returned with invalid time range end date '<date>' from field '<field>' using query <query>

For rule '<rule-id>', the hit with date '<date>' from field '<field>' is outside the query time range. Query: <query>. Document: <document>

Each message has one tag on it: query-result-out-of-time-range

To Verify

To test invalid dateStart/dateEnd, hack the Kibana code to set the values to NaN's:

const epochStart = new Date(dateStart).getTime();
const epochEnd = new Date(dateEnd).getTime();

For instance, change that to:

const epochStart = new Date('x').getTime();
const epochEnd = new Date('y').getTime();

To test the invdivual document hits, first back out the change you made above - when those error, the checks we're testing below do not run. Hack the Kibana code to make the time out of range:

const epochDate = getEpochDateFromString(dateVal);

For instance, change that to:

const epochDate = epochStart - 100

For both tests, create an es query rule - kql or dsl - make the relevant changes, and arrange for the rule to get hits each time. The relevant messages should be logged in the Kibana console when the rule runs.

Checklist

Delete any items that are not applicable to this PR.

@pmuellr pmuellr added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Jun 17, 2024
@pmuellr
Copy link
Member Author

pmuellr commented Jun 17, 2024

/ci

@pmuellr
Copy link
Member Author

pmuellr commented Jul 8, 2024

I think it might also be useful to add the calculated latestTimestamp to the action context, as used in the rule executor https://github.com/elastic/kibana/blob/main/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts

We're starting to see folks send us context variables, so we'd get the value used, via that method, without other debugging.

update: decided to not do this; I'm guessing we may need to tweak what we log, or find we need the info elsewhere, so I think we can defer doing this, till we find we need it.

resolves elastic#175980

Adds a check with logging if an ES Query rule returns hits which fall
outside the time range it's searching.  This shouldn't ever happen, but
seems to be happening on rare occaisons, so we wanted to add some
diagnostics to try to help narrow down the problem.
@pmuellr pmuellr force-pushed the 175980-esq-extra-date-checks branch from 1535a1b to dd6c861 Compare July 12, 2024 00:16
@pmuellr
Copy link
Member Author

pmuellr commented Jul 12, 2024

/ci

@pmuellr
Copy link
Member Author

pmuellr commented Jul 12, 2024

/ci

@pmuellr pmuellr marked this pull request as ready for review July 12, 2024 17:24
@pmuellr pmuellr requested a review from a team as a code owner July 12, 2024 17:24
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Jul 12, 2024

/ci

@ymao1
Copy link
Contributor

ymao1 commented Jul 16, 2024

@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified works as expected.

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

@pmuellr pmuellr requested a review from doakalexi July 22, 2024 14:53
@pmuellr
Copy link
Member Author

pmuellr commented Jul 23, 2024

@elasticmachine merge upstream

@pmuellr pmuellr added the backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) label Jul 23, 2024
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #100 / Screenshots - serverless security UI response ops docs security cases security case settings case settings screenshot

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pmuellr pmuellr merged commit e12e449 into elastic:main Jul 23, 2024
25 checks passed
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.15 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 186332

Questions ?

Please refer to the Backport tool documentation

@pmuellr
Copy link
Member Author

pmuellr commented Jul 23, 2024

💚 All backports created successfully

Status Branch Result
8.15

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

pmuellr added a commit to pmuellr/kibana that referenced this pull request Jul 23, 2024
…ge (elastic#186332)

resolves elastic#175980

## Summary

Adds a check with logging if an ES Query rule returns hits which fall
outside the time range it's searching. This shouldn't ever happen, but
seems to be happening on rare occaisons, so we wanted to add some
diagnostics to try to help narrow down the problem.

Note that the ES|QL flavor rule does not use this diagnostic, just
search source (KQL) and query dsl.

We check 3 things:
- ensure the `dateStart` sent to fetch was valid
- ensure the `dateEnd` sent to fetch was valid
- ensure the relevant time fields in hits are within the
dateStart/dateEnd range

These produce three different error messages:

`For rule '<rule-id>', hits were returned with invalid time range start
date '<date>' from field '<field>' using query <query>`

`For rule '<rule-id>', hits were returned with invalid time range end
date '<date>' from field '<field>' using query <query>`

`For rule '<rule-id>', the hit with date '<date>' from field '<field>'
is outside the query time range. Query: <query>. Document: <document>`

Each message has one tag on it: `query-result-out-of-time-range`

## To Verify

To test invalid dateStart/dateEnd, hack the Kibana code to set the
values to NaN's:

https://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264

For instance, change that to:

    const epochStart = new Date('x').getTime();
    const epochEnd = new Date('y').getTime();

To test the invdivual document hits, first back out the change you made
above - when those error, the checks we're testing below do not run.
Hack the Kibana code to make the time out of range:

https://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294

For instance, change that to:

    const epochDate = epochStart - 100

For both tests, create an es query rule - kql or dsl - make the relevant
changes, and arrange for the rule to get hits each time. The relevant
messages should be logged in the Kibana console when the rule runs.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit e12e449)

# Conflicts:
#	x-pack/plugins/stack_alerts/server/rule_types/es_query/lib/fetch_search_source_query.ts
pmuellr added a commit that referenced this pull request Jul 24, 2024
…ime range (#186332) (#189019)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[ResponseOps] log error when ES Query rules find docs out of time
range (#186332)](#186332)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Patrick
Mueller","email":"patrick.mueller@elastic.co"},"sourceCommit":{"committedDate":"2024-07-23T20:17:11Z","message":"[ResponseOps]
log error when ES Query rules find docs out of time range
(#186332)\n\nresolves
https://github.com/elastic/kibana/issues/175980\r\n\r\n##
Summary\r\n\r\nAdds a check with logging if an ES Query rule returns
hits which fall\r\noutside the time range it's searching. This shouldn't
ever happen, but\r\nseems to be happening on rare occaisons, so we
wanted to add some\r\ndiagnostics to try to help narrow down the
problem.\r\n\r\nNote that the ES|QL flavor rule does not use this
diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3
things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure
the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time
fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese
produce three different error messages:\r\n\r\n`For rule '<rule-id>',
hits were returned with invalid time range start\r\ndate '<date>' from
field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits
were returned with invalid time range end\r\ndate '<date>' from field
'<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit
with date '<date>' from field '<field>'\r\nis outside the query time
range. Query: <query>. Document: <document>`\r\n\r\nEach message has one
tag on it: `query-result-out-of-time-range`\r\n\r\n## To
Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to
set the\r\nvalues to
NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor
instance, change that to:\r\n\r\n const epochStart = new
Date('x').getTime();\r\n const epochEnd = new
Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first
back out the change you made\r\nabove - when those error, the checks
we're testing below do not run.\r\nHack the Kibana code to make the time
out of
range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor
instance, change that to:\r\n\r\n const epochDate = epochStart -
100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make
the relevant\r\nchanges, and arrange for the rule to get hits each time.
The relevant\r\nmessages should be logged in the Kibana console when the
rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","Feature:Alerting/RulesFramework","backport:prev-minor","v8.16.0"],"number":186332,"url":"https://github.com/elastic/kibana/pull/186332","mergeCommit":{"message":"[ResponseOps]
log error when ES Query rules find docs out of time range
(#186332)\n\nresolves
https://github.com/elastic/kibana/issues/175980\r\n\r\n##
Summary\r\n\r\nAdds a check with logging if an ES Query rule returns
hits which fall\r\noutside the time range it's searching. This shouldn't
ever happen, but\r\nseems to be happening on rare occaisons, so we
wanted to add some\r\ndiagnostics to try to help narrow down the
problem.\r\n\r\nNote that the ES|QL flavor rule does not use this
diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3
things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure
the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time
fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese
produce three different error messages:\r\n\r\n`For rule '<rule-id>',
hits were returned with invalid time range start\r\ndate '<date>' from
field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits
were returned with invalid time range end\r\ndate '<date>' from field
'<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit
with date '<date>' from field '<field>'\r\nis outside the query time
range. Query: <query>. Document: <document>`\r\n\r\nEach message has one
tag on it: `query-result-out-of-time-range`\r\n\r\n## To
Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to
set the\r\nvalues to
NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor
instance, change that to:\r\n\r\n const epochStart = new
Date('x').getTime();\r\n const epochEnd = new
Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first
back out the change you made\r\nabove - when those error, the checks
we're testing below do not run.\r\nHack the Kibana code to make the time
out of
range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor
instance, change that to:\r\n\r\n const epochDate = epochStart -
100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make
the relevant\r\nchanges, and arrange for the rule to get hits each time.
The relevant\r\nmessages should be logged in the Kibana console when the
rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/186332","number":186332,"mergeCommit":{"message":"[ResponseOps]
log error when ES Query rules find docs out of time range
(#186332)\n\nresolves
https://github.com/elastic/kibana/issues/175980\r\n\r\n##
Summary\r\n\r\nAdds a check with logging if an ES Query rule returns
hits which fall\r\noutside the time range it's searching. This shouldn't
ever happen, but\r\nseems to be happening on rare occaisons, so we
wanted to add some\r\ndiagnostics to try to help narrow down the
problem.\r\n\r\nNote that the ES|QL flavor rule does not use this
diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3
things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure
the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time
fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese
produce three different error messages:\r\n\r\n`For rule '<rule-id>',
hits were returned with invalid time range start\r\ndate '<date>' from
field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits
were returned with invalid time range end\r\ndate '<date>' from field
'<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit
with date '<date>' from field '<field>'\r\nis outside the query time
range. Query: <query>. Document: <document>`\r\n\r\nEach message has one
tag on it: `query-result-out-of-time-range`\r\n\r\n## To
Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to
set the\r\nvalues to
NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor
instance, change that to:\r\n\r\n const epochStart = new
Date('x').getTime();\r\n const epochEnd = new
Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first
back out the change you made\r\nabove - when those error, the checks
we're testing below do not run.\r\nHack the Kibana code to make the time
out of
range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor
instance, change that to:\r\n\r\n const epochDate = epochStart -
100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make
the relevant\r\nchanges, and arrange for the rule to get hits each time.
The relevant\r\nmessages should be logged in the Kibana console when the
rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8"}}]}]
BACKPORT-->
TinLe added a commit to TinLe/kibana that referenced this pull request Jul 30, 2024
* master: (3487 commits)
  `BedrockChat` & `GeminiChat` (elastic#186809)
  [ResponseOps] log error when ES Query rules find docs out of time range (elastic#186332)
  skip flaky suite (elastic#188997)
  [Security solution][Alert Details] Enable preview feature flag and cypress tests (elastic#188580)
  [EuiProviders] Warn Developer if EuiProvider is missing (elastic#184608)
  [Security Solution ] Fixes Timeline infinite loading bug (elastic#188943)
  Improve SearchSource SearchRequest type (elastic#186862)
  Deprecate Search Sessions config (elastic#188037)
  [Synthetics] Add missing monitorType and tag info in cards !! (elastic#188824)
  [Console Monaco] Resolve uncaught error from tokenizer (elastic#188746)
  [Data Forge] Add `service.logs` dataset as a  data stream (elastic#188786)
  [Console] Fix failing bulk requests (elastic#188552)
  Update dependency terser to ^5.31.2 (main) (elastic#188528)
  [APM][ECO] Telemetry (elastic#188627)
  [Fleet] Fix uninstall package validation accross space (elastic#188749)
  Update warning on `xpack.fleet.enableExperimental` (elastic#188917)
  [DOCS][Cases] Automate more screenshots for cases (elastic#188697)
  [Fleet] Fix get one agent when feature flag disabled (elastic#188953)
  chore(investigate): Add investigate-app plugin from poc (elastic#188122)
  [Monaco Editor] Add Search functionality (elastic#188337)
  ...
TinLe added a commit to TinLe/kibana that referenced this pull request Jul 30, 2024
* master: (2400 commits)
  `BedrockChat` & `GeminiChat` (elastic#186809)
  [ResponseOps] log error when ES Query rules find docs out of time range (elastic#186332)
  skip flaky suite (elastic#188997)
  [Security solution][Alert Details] Enable preview feature flag and cypress tests (elastic#188580)
  [EuiProviders] Warn Developer if EuiProvider is missing (elastic#184608)
  [Security Solution ] Fixes Timeline infinite loading bug (elastic#188943)
  Improve SearchSource SearchRequest type (elastic#186862)
  Deprecate Search Sessions config (elastic#188037)
  [Synthetics] Add missing monitorType and tag info in cards !! (elastic#188824)
  [Console Monaco] Resolve uncaught error from tokenizer (elastic#188746)
  [Data Forge] Add `service.logs` dataset as a  data stream (elastic#188786)
  [Console] Fix failing bulk requests (elastic#188552)
  Update dependency terser to ^5.31.2 (main) (elastic#188528)
  [APM][ECO] Telemetry (elastic#188627)
  [Fleet] Fix uninstall package validation accross space (elastic#188749)
  Update warning on `xpack.fleet.enableExperimental` (elastic#188917)
  [DOCS][Cases] Automate more screenshots for cases (elastic#188697)
  [Fleet] Fix get one agent when feature flag disabled (elastic#188953)
  chore(investigate): Add investigate-app plugin from poc (elastic#188122)
  [Monaco Editor] Add Search functionality (elastic#188337)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Response Ops][Alerting] Investigate ES query rule firing unexpectedly
5 participants