Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] [Attack discovery] Output chunking / refinement, LangGraph migration, and evaluation improvements #195669

Conversation

andrew-goldstein
Copy link
Contributor

@andrew-goldstein andrew-goldstein commented Oct 9, 2024

[Security Solution] [Attack discovery] Output chunking / refinement, LangGraph migration, and evaluation improvements

Summary

This PR improves the Attack discovery user and developer experience with output chunking / refinement, migration to LangGraph, and improvements to evaluations.

The improvements were realized by transitioning from directly using lower-level LangChain apis to LangGraph in this PR, and a deeper integration with the evaluation features of LangSmith.

Output chunking

Output chunking increases the maximum and default number of alerts sent as context, working around the output token limitations of popular large language models (LLMs):

Old New
max alerts 100 500
default alerts 20 200

See Output chunking details below for more information.

Settings

A new settings modal makes it possible to configure the number of alerts sent as context directly from the Attack discovery page:

settings

  • Previously, users configured this value for Attack discovery via the security assistant Knowledge base settings, as documented here
  • The new settings modal uses local storage (instead of the previously-shared assistant Knowledge base setting, which is stored in Elasticsearch)

Output refinement

Output refinement automatically combines related discoveries (that were previously represented as two or more discoveries):

default_attack_discovery_graph

  • The refine step in the graph diagram above may (for example), combine three discoveries from the generate step into two discoveries when they are related

Hallucination detection

New hallucination detection displays an error in lieu of showing hallucinated output:

hallucination_detection

  • A new tour step was added to the Attack discovery page to share the improvements:

tour_step

Summary of improvements for developers

The following features improve the developer experience when running evaluations for Attack discovery:

Replay alerts in evaluations

This evaluation feature eliminates the need to populate a local environment with alerts to (re)run evaluations:

alerts_as_input

Alert replay skips the retrieve_anonymized_alerts step in the graph, because it uses the anonymizedAlerts and replacements provided as Input in a dataset example. See Replay alerts in evaluations details below for more information.

Override graph state

Override graph state via datatset examples to test prompt improvements and edge cases via evaluations:

override_graph_input

To use this feature, add an overrides key to the Input of a dataset example. See Override graph state details below for more information.

New custom evaluator

Prior to this PR, an evaluator had to be manually added to each dataset in LangSmith to use an LLM as the judge for correctness.

This PR introduces a custom, programmatic evaluator that handles anonymization automatically, and eliminates the need to manually create evaluators in LangSmith. To use it, simply run evaluations from the Evaluation tab in settings.

New evaluation settings

This PR introduces new settings in the Evaluation tab:

new_evaluation_settings

New evaluation settings:

  • Evaluator model (optional) - Judge the quality of predictions using a single model. (Default: use the same model as the connector)

This new setting is useful when you want to use the same model, e.g. GPT-4o to judge the quality of all the models evaluated in an experiment.

  • Default max alerts - The default maximum number of alerts to send as context, which may be overridden by the example input

This new setting is useful when using the alerts in the local environment to run evaluations. Examples that use the Alerts replay feature will ignore this value, because the alerts in the example Input will be used instead.

Directory structure refactoring

  • The server-side directory structure was refactored to consolidate the location of Attack discovery related files

Details

This section describes some of the improvements above in detail.

Output chunking details

The new output chunking feature increases the maximum and default number of alerts that may be sent as context. It achieves this improvement by working around output token limitations.

LLMs have different limits for the number of tokens accepted as input for requests, and the number of tokens available for output when generating responses.

Today, the output token limits of most popular models are significantly smaller than the input token limits.

For example, at the time of this writing, the Gemini 1.5 Pro model's limits are (source):

  • Input token limit: 2,097,152
  • Output token limit: 8,192

As a result of this relatively smaller output token limit, previous versions of Attack discovery would simply fail when an LLM ran out of output tokens when generating a response. This often happened "mid sentence", and resulted in errors or hallucinations being displayed to users.

The new output chunking feature detects incomplete responses from the LLM in the generate step of the Graph. When an incomplete response is detected, the generate step will run again with:

  • The original prompt
  • The Alerts provided as context
  • The partially generated response
  • Instructions to "continue where you left off"

The generate step in the graph will run until one of the following conditions is met:

  • The incomplete response can be successfully parsed
  • The maximum number of generation attempts (default: 10) is reached
  • The maximum number of hallucinations detected (default: 5) is reached

Output refinement details

The new output refinement feature automatically combines related discoveries (that were previously represented as two or more discoveries).

The new refine step in the graph re-submits the discoveries from the generate step with a refinePrompt to combine related attack discoveries.

The refine step is subject to the model's output token limits, just like the generate step. That means a response to the refine prompt from the LLM may be cut off "mid" sentence. To that end:

  • The refine step will re-run until the (same, shared) maxGenerationAttempts and maxHallucinationFailures limits as the generate step are reached
  • The maximum number of attempts (default: 10) is shared with the generate step. For example, if it took 7 tries (generationAttempts) to complete the generate step, the refine step will only run up to 3 times.

The refine step will return unrefined results from the generate step when:

  • The generate step uses all 10 generation attempts. When this happens, the refine step will be skipped, and the unrefined output of the generate step will be returned to the user
  • If the refine step uses all remaining attempts, but fails to produce a refined response, due to output token limitations, or hallucinations in the refined response

Hallucination detection details

Before this PR, Attack discovery directly used lower level LangChain APIs to parse responses from the LLM. After this PR, Attack discovery uses LangGraph.

In the previous implementation, when Attack discovery received an incomplete response because the output token limits of a model were hit, the LangChain APIs automatically re-submitted the incomplete response in an attempt to "repair" it. However, the re-submitted results didn't include all of the original context (i.e. alerts that generated them). The repair process often resulted in hallucinated results being presented to users, especially with some models i.e. Claude 3.5 Haiku.

In this PR, the generate and refine steps detect (some) hallucinations. When hallucinations are detected:

  • The current accumulated generations or refinements are (respectively) discarded, effectively restarting the generate or refine process
  • The generate and refine steps will be retried until the maximum generation attempts (default: 10) or hallucinations detected (default: 5) limits are reached

Hitting the hallucination limit during the generate step will result in an error being displayed to the user.

Hitting the hallucination limit during the refine step will result in the unrefined discoveries being displayed to the user.

Replay alerts in evaluations details

Alerts replay makes it possible to re-run evaluations, even when your local deployment has zero alerts.

This feature eliminates the chore of populating your local instance with specific alerts for each example.

Every example in a dataset may (optionally) specify a different set of alerts.

Alert replay skips the retrieve_anonymized_alerts step in the graph, because it uses the anonymizedAlerts and replacements provided as Input in a dataset example.

The following instructions document the process of creating a new LangSmith dataset example that uses the Alerts replay feature:

  1. In Kibana, navigate to Security > Attack discovery

  2. Click Generate to generate Attack discoveries

  3. In LangSmith, navigate to Projects > Your project

  4. In the Runs tab of the LangSmith project, click on the latest Attack discovery entry to open the trace

  5. IMPORTANT: In the trace, select the LAST ChannelWriteChannelWrite<attackDiscoveries,attackDisc... entry. The last entry will appear inside the LAST refine step in the trace, as illustrated by the screenshot below:

last_channel_write

  1. With the last ChannelWriteChannelWrite<attackDiscoveries,attackDisc... entry selected, click Add to > Add to Dataset

  2. Copy-paste the Input to the Output, because evaluation Experiments always compare the current run with the Output in an example.

  • This step is always required to create a dataset.
  • If you don't want to use the Alert replay feature, replace Input with an empty object:
{}
  1. Choose an existing dataset, or create a new one

  2. Click the Submit button to add the example to the dataset.

After completing the steps above, the dataset is ready to be run in evaluations.

Override graph state details

When a dataset is run in an evaluation (to create Experiments):

  • The (optional) anonymizedAlerts and replacements provided as Input in the example will be replayed, bypassing the retrieve_anonymized_alerts step in the graph
  • The rest of the properties in Input will not be used as inputs to the graph
  • In contrast, an empty object {} in Input means the latest and riskiest alerts in the last 24 hours in the local environment will be queried

In addition to the above, you may add an optional overrides key in the Input of a dataset example to test changes or edge cases. This is useful for evaluating changes without updating the code directly.

The overrides set the initial state of the graph before it's run in an evaluation.

The example Input below overrides the prompts used in the generate and refine steps:

{
  "overrides": {
    "refinePrompt": "This overrides the refine prompt",
    "attackDiscoveryPrompt": "This overrides the attack discovery prompt"
  }
}

To use the overrides feature in evaluations to set the initial state of the graph:

  1. Create a dataset example, as documented in the Replay alerts in evaluations details section above

  2. In LangSmith, navigate to Datasets & Testing > Your Dataset

  3. In the dataset, click the Examples tab

  4. Click an example to open it in the flyout

  5. Click the Edit button to edit the example

  6. Add the overrides key shown below to the Input e.g.:

{
  "overrides": {
    "refinePrompt": "This overrides the refine prompt",
    "attackDiscoveryPrompt": "This overrides the attack discovery prompt"
  }
}
  1. Edit the overrides in the example Input above to add (or remove) entries that will determine the initial state of the graph.

All of the overides shown in step 6 are optional. The refinePrompt and attackDiscoveryPrompt could be removed from the overrides example above, and replaced with maxGenerationAttempts to test a higher limit.

All valid graph state may be specified in overrides.

@andrew-goldstein andrew-goldstein added release_note:enhancement v9.0.0 Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Security Generative AI Security Generative AI v8.16.0 backport:version Backport to applied version labels labels Oct 9, 2024
@andrew-goldstein andrew-goldstein self-assigned this Oct 9, 2024
@andrew-goldstein andrew-goldstein requested review from a team as code owners October 9, 2024 18:50
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@andrew-goldstein andrew-goldstein force-pushed the attack_discovery_output_token_limits branch 6 times, most recently from 0945d68 to b7d98b5 Compare October 10, 2024 19:26
@andrew-goldstein andrew-goldstein added ci:cloud-deploy Create or update a Cloud deployment ci:cloud-persist-deployment Persist cloud deployment indefinitely labels Oct 10, 2024
@andrew-goldstein andrew-goldstein force-pushed the attack_discovery_output_token_limits branch 3 times, most recently from 2a75fa9 to 2366cb2 Compare October 14, 2024 17:15
@@ -78,6 +78,36 @@ export const CONNECTORS_LABEL = i18n.translate(
}
);

export const EVALUATOR_MODEL = i18n.translate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Evals is always developer only, right? Never user facing? We probably don't need to be using i18n resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the audience of evals is developers, who must enable this UI via the assistantModelEvaluation feature flag. The original author of this file also used i18n, so the new entries are consistent with that style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nit so no big deal, but it might be nice to remove this pattern to avoid unnecessary resource use

…t, LangGraph migration, and evaluation improvements

### Summary

This PR improves the Attack discovery user and developer experience with output chunking / refinement, migration to LangGraph, and improvements to evaluations.

The improvements were realized by transitioning from directly using lower-level LangChain apis to LangGraph in this PR, and a deeper integration with the evaluation features of LangSmith.

#### Output chunking

_Output chunking_ increases the maximum and default number of alerts sent as context, working around the output token limitations of popular large language models (LLMs):

|                | Old   | New   |
|----------------|-------|-------|
| max alerts     | `100` | `500` |
| default alerts | `20`  | `250` |

See _Output chunking details_ below for more information.

#### Settings

A new settings modal makes it possible to configure the number of alerts sent as context directly from the Attack discovery page:

![settings](https://github.com/user-attachments/assets/3f5ab4e9-5eae-4f99-8490-e392c758fa6e)

- Previously, users configured this value for Attack discovery via the security assistant Knowledge base settings, as documented [here](https://www.elastic.co/guide/en/security/8.15/attack-discovery.html#attack-discovery-generate-discoveries)
- The new settings modal uses local storage (instead of the previously-shared assistant Knowledge base setting, which is stored in Elasticsearch)

#### Output refinement

_Output refinement_ automatically combines related discoveries (that were previously represented as two or more discoveries):

  ![default_attack_discovery_graph](https://github.com/user-attachments/assets/c092bb42-a41e-4fba-85c2-a4b2c1ef3053)

- The `refine` step in the graph diagram above may (for example), combine three discoveries from the `generate` step into two discoveries when they are related

### Hallucination detection

New _hallucination detection_ displays an error in lieu of showing hallucinated output:

![hallucination_detection](https://github.com/user-attachments/assets/1d849908-3f10-4fe8-8741-c0cf418b1524)

- A new tour step was added to the Attack discovery page to share the improvements:

![tour_step](https://github.com/user-attachments/assets/0cedf770-baba-41b1-8ec6-b12b14c0c57a)

### Summary of improvements for developers

The following features improve the developer experience when running evaluations for Attack discovery:

#### Replay alerts in evaluations

This evaluation feature eliminates the need to populate a local environment with alerts to (re)run evaluations:

  ![alerts_as_input](https://github.com/user-attachments/assets/b29dc847-3d53-4b17-8757-ed59852c1623)

Alert replay skips the `retrieve_anonymized_alerts` step in the graph, because it uses the `anonymizedAlerts` and `replacements` provided as `Input` in a dataset example. See _Replay alerts in evaluations details_ below for more information.

#### Override graph state

Override graph state via datatset examples to test prompt improvements and edge cases via evaluations:

  ![override_graph_input](https://github.com/user-attachments/assets/a685177b-1e07-4f49-9b8d-c0b652975237)

To use this feature, add an `overrides` key to the `Input` of a dataset example. See _Override graph state details_ below for more information.

#### New custom evaluator

Prior to this PR, an evaluator had to be manually added to each dataset in LangSmith to use an LLM as the judge for correctness.

This PR introduces a custom, programmatic evaluator that handles anonymization automatically, and eliminates the need to manually create evaluators in LangSmith. To use it, simply run evaluations from the `Evaluation` tab in settings.

#### New evaluation settings

This PR introduces new settings in the `Evaluation` tab:

![new_evaluation_settings](https://github.com/user-attachments/assets/ca72aa2a-b0dc-4bec-9409-386d77d6a2f4)

New evaluation settings:

- `Evaluator model (optional)` - Judge the quality of predictions using a single model. (Default: use the same model as the connector)

This new setting is useful when you want to use the same model, e.g. `GPT-4o` to judge the quality of all the models evaluated in an experiment.

- `Default max alerts` - The default maximum number of alerts to send as context, which may be overridden by the example input

This new setting is useful when using the alerts in the local environment to run evaluations. Examples that use the Alerts replay feature will ignore this value, because the alerts in the example `Input` will be used instead.

#### Directory structure refactoring

- The server-side directory structure was refactored to consolidate the location of Attack discovery related files

### Details

This section describes some of the improvements above in detail.

#### Output chunking details

The new output chunking feature increases the maximum and default number of alerts that may be sent as context. It achieves this improvement by working around output token limitations.

LLMs have different limits for the number of tokens accepted as _input_ for requests, and the number of tokens available for _output_ when generating responses.

Today, the output token limits of most popular models are significantly smaller than the input token limits.

For example, at the time of this writing, the Gemini 1.5 Pro model's limits are ([source](https://ai.google.dev/gemini-api/docs/models/gemini)):

- Input token limit: `2,097,152`
- Output token limit: `8,192`

As a result of this relatively smaller output token limit, previous versions of Attack discovery would simply fail when an LLM ran out of output tokens when generating a response. This often happened "mid sentence", and resulted in errors or hallucinations being displayed to users.

The new output chunking feature detects incomplete responses from the LLM in the `generate` step of the Graph. When an incomplete response is detected, the `generate` step will run again with:

- The original prompt
- The Alerts provided as context
- The partially generated response
- Instructions to "continue where you left off"

The `generate` step in the graph will run until one of the following conditions is met:

- The incomplete response can be successfully parsed
- The maximum number of generation attempts (default: `10`) is reached
- The maximum number of hallucinations detected (default: `5`) is reached

#### Output refinement details

The new output refinement feature automatically combines related discoveries (that were previously represented as two or more discoveries).

The new `refine` step in the graph re-submits the discoveries from the `generate` step with a `refinePrompt` to combine related attack discoveries.

The `refine` step is subject to the model's output token limits, just like the `generate` step. That means a response to the refine prompt from the LLM may be cut off "mid" sentence. To that end:

- The refine step will re-run until the (same, shared) `maxGenerationAttempts` and `maxHallucinationFailures` limits as the `generate` step are reached
- The maximum number of attempts (default: `10`) is _shared_ with the `generate` step. For example, if it took `7` tries (`generationAttempts`) to complete the `generate` step, the refine `step` will only run up to `3` times.

The `refine` step will return _unrefined_ results from the `generate` step when:

- The `generate` step uses all `10` generation attempts. When this happens, the `refine` step will be skipped, and the unrefined output of the `generate` step will be returned to the user
- If the `refine` step uses all remaining attempts, but fails to produce a refined response, due to output token limitations, or hallucinations in the refined response

#### Hallucination detection details

Before this PR, Attack discovery directly used lower level LangChain APIs to parse responses from the LLM. After this PR, Attack discovery uses LangGraph.

In the previous implementation, when Attack discovery received an incomplete response because the output token limits of a model were hit, the LangChain APIs automatically re-submitted the incomplete response in an attempt to "repair" it. However, the re-submitted results didn't include all of the original context (i.e. alerts that generated them). The repair process often resulted in hallucinated results being presented to users, especially with some models i.e. `Claude 3.5 Haiku`.

In this PR, the `generate` and `refine` steps detect (some) hallucinations. When hallucinations are detected:

- The current accumulated `generations` or `refinements` are (respectively) discarded, effectively restarting the `generate` or `refine` process
- The `generate` and `refine` steps will be retried until the maximum generation attempts (default: `10`) or hallucinations detected (default: `5`) limits are reached

Hitting the hallucination limit during the `generate` step will result in an error being displayed to the user.

Hitting the hallucination limit during the `refine` step will result in the unrefined discoveries being displayed to the user.

#### Replay alerts in evaluations details

Alerts replay makes it possible to re-run evaluations, even when your local deployment has zero alerts.

This feature eliminates the chore of populating your local instance with specific alerts for each example.

Every example in a dataset may (optionally) specify a different set of alerts.

Alert replay skips the `retrieve_anonymized_alerts` step in the graph, because it uses the `anonymizedAlerts` and `replacements` provided as `Input` in a dataset example.

The following instructions document the process of creating a new LangSmith dataset example that uses the Alerts replay feature:

1) In Kibana, navigate to Security > Attack discovery

2) Click `Generate` to generate Attack discoveries

3) In LangSmith, navigate to Projects > _Your project_

4) In the `Runs` tab of the LangSmith project, click on the latest `Attack discovery` entry to open the trace

5) **IMPORTANT**: In the trace, select the **LAST** `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry. The last entry will appear inside the **LAST** `refine` step in the trace, as illustrated by the screenshot below:

![last_channel_write](https://github.com/user-attachments/assets/c57fc803-3bbb-4603-b99f-d2b130428201)

6) With the last `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry selected, click `Add to` > `Add to Dataset`

7) Copy-paste the `Input` to the `Output`, because evaluation Experiments always compare the current run with the `Output` in an example.

- This step is _always_ required to create a dataset.
- If you don't want to use the Alert replay feature, replace `Input` with an empty object:

```json
{}
```

8) Choose an existing dataset, or create a new one

9) Click the `Submit` button to add the example to the dataset.

After completing the steps above, the dataset is ready to be run in evaluations.

#### Override graph state details

When a dataset is run in an evaluation (to create Experiments):

- The (optional) `anonymizedAlerts` and `replacements` provided as `Input` in the example will be replayed, bypassing the `retrieve_anonymized_alerts` step in the graph
- The rest of the properties in `Input` will not be used as inputs to the graph
- In contrast, an empty object `{}` in `Input` means the latest and riskiest alerts in the last 24 hours in the local environment will be queried

In addition to the above, you may add an optional `overrides` key in the `Input` of a dataset example to test changes or edge cases. This is useful for evaluating changes without updating the code directly.

The `overrides` set the initial state of the graph before it's run in an evaluation.

The example `Input` below overrides the prompts used in the `generate` and `refine` steps:

```json
{
  "overrides": {
    "refinePrompt": "This overrides the refine prompt",
    "attackDiscoveryPrompt": "This overrides the attack discovery prompt"
  }
}
```

To use the `overrides` feature in evaluations to set the initial state of the graph:

1) Create a dataset example, as documented in the _Replay alerts in evaluations details_ section above

2) In LangSmith, navigate to Datasets & Testing > _Your Dataset_

3) In the dataset, click the Examples tab

4) Click an example to open it in the flyout

5) Click the `Edit` button to edit the example

6) Add the `overrides` key shown below to the `Input` e.g.:

```json
{
  "overrides": {
    "refinePrompt": "This overrides the refine prompt",
    "attackDiscoveryPrompt": "This overrides the attack discovery prompt"
  }
}
```

7) Edit the `overrides` in the example `Input` above to add (or remove) entries that will determine the initial state of the graph.

All of the `overides` shown in step 6 are optional. The `refinePrompt` and `attackDiscoveryPrompt` could be removed from the `overrides` example above, and replaced with `maxGenerationAttempts` to test a higher limit.

All valid graph state may be specified in `overrides`.
- set `DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` to `200`
@andrew-goldstein andrew-goldstein force-pushed the attack_discovery_output_token_limits branch from 2366cb2 to e677f5a Compare October 15, 2024 00:28
@elasticmachine

This comment was marked as outdated.

@andrew-goldstein andrew-goldstein merged commit 2c21adb into elastic:main Oct 15, 2024
42 checks passed
@andrew-goldstein andrew-goldstein deleted the attack_discovery_output_token_limits branch October 15, 2024 14:39
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11348436218

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 15, 2024
…LangGraph migration, and evaluation improvements (elastic#195669)

## [Security Solution] [Attack discovery] Output chunking / refinement, LangGraph migration, and evaluation improvements

### Summary

This PR improves the Attack discovery user and developer experience with output chunking / refinement, migration to LangGraph, and improvements to evaluations.

The improvements were realized by transitioning from directly using lower-level LangChain apis to LangGraph in this PR, and a deeper integration with the evaluation features of LangSmith.

#### Output chunking

_Output chunking_ increases the maximum and default number of alerts sent as context, working around the output token limitations of popular large language models (LLMs):

|                | Old   | New   |
|----------------|-------|-------|
| max alerts     | `100` | `500` |
| default alerts | `20`  | `200` |

See _Output chunking details_ below for more information.

#### Settings

A new settings modal makes it possible to configure the number of alerts sent as context directly from the Attack discovery page:

![settings](https://github.com/user-attachments/assets/3f5ab4e9-5eae-4f99-8490-e392c758fa6e)

- Previously, users configured this value for Attack discovery via the security assistant Knowledge base settings, as documented [here](https://www.elastic.co/guide/en/security/8.15/attack-discovery.html#attack-discovery-generate-discoveries)
- The new settings modal uses local storage (instead of the previously-shared assistant Knowledge base setting, which is stored in Elasticsearch)

#### Output refinement

_Output refinement_ automatically combines related discoveries (that were previously represented as two or more discoveries):

  ![default_attack_discovery_graph](https://github.com/user-attachments/assets/c092bb42-a41e-4fba-85c2-a4b2c1ef3053)

- The `refine` step in the graph diagram above may (for example), combine three discoveries from the `generate` step into two discoveries when they are related

### Hallucination detection

New _hallucination detection_ displays an error in lieu of showing hallucinated output:

![hallucination_detection](https://github.com/user-attachments/assets/1d849908-3f10-4fe8-8741-c0cf418b1524)

- A new tour step was added to the Attack discovery page to share the improvements:

![tour_step](https://github.com/user-attachments/assets/0cedf770-baba-41b1-8ec6-b12b14c0c57a)

### Summary of improvements for developers

The following features improve the developer experience when running evaluations for Attack discovery:

#### Replay alerts in evaluations

This evaluation feature eliminates the need to populate a local environment with alerts to (re)run evaluations:

  ![alerts_as_input](https://github.com/user-attachments/assets/b29dc847-3d53-4b17-8757-ed59852c1623)

Alert replay skips the `retrieve_anonymized_alerts` step in the graph, because it uses the `anonymizedAlerts` and `replacements` provided as `Input` in a dataset example. See _Replay alerts in evaluations details_ below for more information.

#### Override graph state

Override graph state via datatset examples to test prompt improvements and edge cases via evaluations:

  ![override_graph_input](https://github.com/user-attachments/assets/a685177b-1e07-4f49-9b8d-c0b652975237)

To use this feature, add an `overrides` key to the `Input` of a dataset example. See _Override graph state details_ below for more information.

#### New custom evaluator

Prior to this PR, an evaluator had to be manually added to each dataset in LangSmith to use an LLM as the judge for correctness.

This PR introduces a custom, programmatic evaluator that handles anonymization automatically, and eliminates the need to manually create evaluators in LangSmith. To use it, simply run evaluations from the `Evaluation` tab in settings.

#### New evaluation settings

This PR introduces new settings in the `Evaluation` tab:

![new_evaluation_settings](https://github.com/user-attachments/assets/ca72aa2a-b0dc-4bec-9409-386d77d6a2f4)

New evaluation settings:

- `Evaluator model (optional)` - Judge the quality of predictions using a single model. (Default: use the same model as the connector)

This new setting is useful when you want to use the same model, e.g. `GPT-4o` to judge the quality of all the models evaluated in an experiment.

- `Default max alerts` - The default maximum number of alerts to send as context, which may be overridden by the example input

This new setting is useful when using the alerts in the local environment to run evaluations. Examples that use the Alerts replay feature will ignore this value, because the alerts in the example `Input` will be used instead.

#### Directory structure refactoring

- The server-side directory structure was refactored to consolidate the location of Attack discovery related files

### Details

This section describes some of the improvements above in detail.

#### Output chunking details

The new output chunking feature increases the maximum and default number of alerts that may be sent as context. It achieves this improvement by working around output token limitations.

LLMs have different limits for the number of tokens accepted as _input_ for requests, and the number of tokens available for _output_ when generating responses.

Today, the output token limits of most popular models are significantly smaller than the input token limits.

For example, at the time of this writing, the Gemini 1.5 Pro model's limits are ([source](https://ai.google.dev/gemini-api/docs/models/gemini)):

- Input token limit: `2,097,152`
- Output token limit: `8,192`

As a result of this relatively smaller output token limit, previous versions of Attack discovery would simply fail when an LLM ran out of output tokens when generating a response. This often happened "mid sentence", and resulted in errors or hallucinations being displayed to users.

The new output chunking feature detects incomplete responses from the LLM in the `generate` step of the Graph. When an incomplete response is detected, the `generate` step will run again with:

- The original prompt
- The Alerts provided as context
- The partially generated response
- Instructions to "continue where you left off"

The `generate` step in the graph will run until one of the following conditions is met:

- The incomplete response can be successfully parsed
- The maximum number of generation attempts (default: `10`) is reached
- The maximum number of hallucinations detected (default: `5`) is reached

#### Output refinement details

The new output refinement feature automatically combines related discoveries (that were previously represented as two or more discoveries).

The new `refine` step in the graph re-submits the discoveries from the `generate` step with a `refinePrompt` to combine related attack discoveries.

The `refine` step is subject to the model's output token limits, just like the `generate` step. That means a response to the refine prompt from the LLM may be cut off "mid" sentence. To that end:

- The refine step will re-run until the (same, shared) `maxGenerationAttempts` and `maxHallucinationFailures` limits as the `generate` step are reached
- The maximum number of attempts (default: `10`) is _shared_ with the `generate` step. For example, if it took `7` tries (`generationAttempts`) to complete the `generate` step, the refine `step` will only run up to `3` times.

The `refine` step will return _unrefined_ results from the `generate` step when:

- The `generate` step uses all `10` generation attempts. When this happens, the `refine` step will be skipped, and the unrefined output of the `generate` step will be returned to the user
- If the `refine` step uses all remaining attempts, but fails to produce a refined response, due to output token limitations, or hallucinations in the refined response

#### Hallucination detection details

Before this PR, Attack discovery directly used lower level LangChain APIs to parse responses from the LLM. After this PR, Attack discovery uses LangGraph.

In the previous implementation, when Attack discovery received an incomplete response because the output token limits of a model were hit, the LangChain APIs automatically re-submitted the incomplete response in an attempt to "repair" it. However, the re-submitted results didn't include all of the original context (i.e. alerts that generated them). The repair process often resulted in hallucinated results being presented to users, especially with some models i.e. `Claude 3.5 Haiku`.

In this PR, the `generate` and `refine` steps detect (some) hallucinations. When hallucinations are detected:

- The current accumulated `generations` or `refinements` are (respectively) discarded, effectively restarting the `generate` or `refine` process
- The `generate` and `refine` steps will be retried until the maximum generation attempts (default: `10`) or hallucinations detected (default: `5`) limits are reached

Hitting the hallucination limit during the `generate` step will result in an error being displayed to the user.

Hitting the hallucination limit during the `refine` step will result in the unrefined discoveries being displayed to the user.

#### Replay alerts in evaluations details

Alerts replay makes it possible to re-run evaluations, even when your local deployment has zero alerts.

This feature eliminates the chore of populating your local instance with specific alerts for each example.

Every example in a dataset may (optionally) specify a different set of alerts.

Alert replay skips the `retrieve_anonymized_alerts` step in the graph, because it uses the `anonymizedAlerts` and `replacements` provided as `Input` in a dataset example.

The following instructions document the process of creating a new LangSmith dataset example that uses the Alerts replay feature:

1) In Kibana, navigate to Security > Attack discovery

2) Click `Generate` to generate Attack discoveries

3) In LangSmith, navigate to Projects > _Your project_

4) In the `Runs` tab of the LangSmith project, click on the latest `Attack discovery` entry to open the trace

5) **IMPORTANT**: In the trace, select the **LAST** `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry. The last entry will appear inside the **LAST** `refine` step in the trace, as illustrated by the screenshot below:

![last_channel_write](https://github.com/user-attachments/assets/c57fc803-3bbb-4603-b99f-d2b130428201)

6) With the last `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry selected, click `Add to` > `Add to Dataset`

7) Copy-paste the `Input` to the `Output`, because evaluation Experiments always compare the current run with the `Output` in an example.

- This step is _always_ required to create a dataset.
- If you don't want to use the Alert replay feature, replace `Input` with an empty object:

```json
{}
```

8) Choose an existing dataset, or create a new one

9) Click the `Submit` button to add the example to the dataset.

After completing the steps above, the dataset is ready to be run in evaluations.

#### Override graph state details

When a dataset is run in an evaluation (to create Experiments):

- The (optional) `anonymizedAlerts` and `replacements` provided as `Input` in the example will be replayed, bypassing the `retrieve_anonymized_alerts` step in the graph
- The rest of the properties in `Input` will not be used as inputs to the graph
- In contrast, an empty object `{}` in `Input` means the latest and riskiest alerts in the last 24 hours in the local environment will be queried

In addition to the above, you may add an optional `overrides` key in the `Input` of a dataset example to test changes or edge cases. This is useful for evaluating changes without updating the code directly.

The `overrides` set the initial state of the graph before it's run in an evaluation.

The example `Input` below overrides the prompts used in the `generate` and `refine` steps:

```json
{
  "overrides": {
    "refinePrompt": "This overrides the refine prompt",
    "attackDiscoveryPrompt": "This overrides the attack discovery prompt"
  }
}
```

To use the `overrides` feature in evaluations to set the initial state of the graph:

1) Create a dataset example, as documented in the _Replay alerts in evaluations details_ section above

2) In LangSmith, navigate to Datasets & Testing > _Your Dataset_

3) In the dataset, click the Examples tab

4) Click an example to open it in the flyout

5) Click the `Edit` button to edit the example

6) Add the `overrides` key shown below to the `Input` e.g.:

```json
{
  "overrides": {
    "refinePrompt": "This overrides the refine prompt",
    "attackDiscoveryPrompt": "This overrides the attack discovery prompt"
  }
}
```

7) Edit the `overrides` in the example `Input` above to add (or remove) entries that will determine the initial state of the graph.

All of the `overides` shown in step 6 are optional. The `refinePrompt` and `attackDiscoveryPrompt` could be removed from the `overrides` example above, and replaced with `maxGenerationAttempts` to test a higher limit.

All valid graph state may be specified in `overrides`.

(cherry picked from commit 2c21adb)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 15, 2024
…ment, LangGraph migration, and evaluation improvements (#195669) (#196334)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Security Solution] [Attack discovery] Output chunking / refinement,
LangGraph migration, and evaluation improvements
(#195669)](#195669)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Andrew
Macri","email":"andrew.macri@elastic.co"},"sourceCommit":{"committedDate":"2024-10-15T14:39:48Z","message":"[Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation improvements (#195669)\n\n## [Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation improvements\r\n\r\n### Summary\r\n\r\nThis PR
improves the Attack discovery user and developer experience with output
chunking / refinement, migration to LangGraph, and improvements to
evaluations.\r\n\r\nThe improvements were realized by transitioning from
directly using lower-level LangChain apis to LangGraph in this PR, and a
deeper integration with the evaluation features of
LangSmith.\r\n\r\n#### Output chunking\r\n\r\n_Output chunking_
increases the maximum and default number of alerts sent as context,
working around the output token limitations of popular large language
models (LLMs):\r\n\r\n| | Old | New
|\r\n|----------------|-------|-------|\r\n| max alerts | `100` | `500`
|\r\n| default alerts | `20` | `200` |\r\n\r\nSee _Output chunking
details_ below for more information.\r\n\r\n#### Settings\r\n\r\nA new
settings modal makes it possible to configure the number of alerts sent
as context directly from the Attack discovery
page:\r\n\r\n![settings](https://github.com/user-attachments/assets/3f5ab4e9-5eae-4f99-8490-e392c758fa6e)\r\n\r\n-
Previously, users configured this value for Attack discovery via the
security assistant Knowledge base settings, as documented
[here](https://www.elastic.co/guide/en/security/8.15/attack-discovery.html#attack-discovery-generate-discoveries)\r\n-
The new settings modal uses local storage (instead of the
previously-shared assistant Knowledge base setting, which is stored in
Elasticsearch)\r\n\r\n#### Output refinement\r\n\r\n_Output refinement_
automatically combines related discoveries (that were previously
represented as two or more discoveries):\r\n\r\n
![default_attack_discovery_graph](https://github.com/user-attachments/assets/c092bb42-a41e-4fba-85c2-a4b2c1ef3053)\r\n\r\n-
The `refine` step in the graph diagram above may (for example), combine
three discoveries from the `generate` step into two discoveries when
they are related\r\n\r\n### Hallucination detection\r\n\r\nNew
_hallucination detection_ displays an error in lieu of showing
hallucinated
output:\r\n\r\n![hallucination_detection](https://github.com/user-attachments/assets/1d849908-3f10-4fe8-8741-c0cf418b1524)\r\n\r\n-
A new tour step was added to the Attack discovery page to share the
improvements:\r\n\r\n![tour_step](https://github.com/user-attachments/assets/0cedf770-baba-41b1-8ec6-b12b14c0c57a)\r\n\r\n###
Summary of improvements for developers\r\n\r\nThe following features
improve the developer experience when running evaluations for Attack
discovery:\r\n\r\n#### Replay alerts in evaluations\r\n\r\nThis
evaluation feature eliminates the need to populate a local environment
with alerts to (re)run evaluations:\r\n\r\n
![alerts_as_input](https://github.com/user-attachments/assets/b29dc847-3d53-4b17-8757-ed59852c1623)\r\n\r\nAlert
replay skips the `retrieve_anonymized_alerts` step in the graph, because
it uses the `anonymizedAlerts` and `replacements` provided as `Input` in
a dataset example. See _Replay alerts in evaluations details_ below for
more information.\r\n\r\n#### Override graph state\r\n\r\nOverride graph
state via datatset examples to test prompt improvements and edge cases
via evaluations:\r\n\r\n
![override_graph_input](https://github.com/user-attachments/assets/a685177b-1e07-4f49-9b8d-c0b652975237)\r\n\r\nTo
use this feature, add an `overrides` key to the `Input` of a dataset
example. See _Override graph state details_ below for more
information.\r\n\r\n#### New custom evaluator\r\n\r\nPrior to this PR,
an evaluator had to be manually added to each dataset in LangSmith to
use an LLM as the judge for correctness.\r\n\r\nThis PR introduces a
custom, programmatic evaluator that handles anonymization automatically,
and eliminates the need to manually create evaluators in LangSmith. To
use it, simply run evaluations from the `Evaluation` tab in
settings.\r\n\r\n#### New evaluation settings\r\n\r\nThis PR introduces
new settings in the `Evaluation`
tab:\r\n\r\n![new_evaluation_settings](https://github.com/user-attachments/assets/ca72aa2a-b0dc-4bec-9409-386d77d6a2f4)\r\n\r\nNew
evaluation settings:\r\n\r\n- `Evaluator model (optional)` - Judge the
quality of predictions using a single model. (Default: use the same
model as the connector)\r\n\r\nThis new setting is useful when you want
to use the same model, e.g. `GPT-4o` to judge the quality of all the
models evaluated in an experiment.\r\n\r\n- `Default max alerts` - The
default maximum number of alerts to send as context, which may be
overridden by the example input\r\n\r\nThis new setting is useful when
using the alerts in the local environment to run evaluations. Examples
that use the Alerts replay feature will ignore this value, because the
alerts in the example `Input` will be used instead.\r\n\r\n####
Directory structure refactoring\r\n\r\n- The server-side directory
structure was refactored to consolidate the location of Attack discovery
related files\r\n\r\n### Details\r\n\r\nThis section describes some of
the improvements above in detail.\r\n\r\n#### Output chunking
details\r\n\r\nThe new output chunking feature increases the maximum and
default number of alerts that may be sent as context. It achieves this
improvement by working around output token limitations.\r\n\r\nLLMs have
different limits for the number of tokens accepted as _input_ for
requests, and the number of tokens available for _output_ when
generating responses.\r\n\r\nToday, the output token limits of most
popular models are significantly smaller than the input token
limits.\r\n\r\nFor example, at the time of this writing, the Gemini 1.5
Pro model's limits are
([source](https://ai.google.dev/gemini-api/docs/models/gemini)):\r\n\r\n-
Input token limit: `2,097,152`\r\n- Output token limit:
`8,192`\r\n\r\nAs a result of this relatively smaller output token
limit, previous versions of Attack discovery would simply fail when an
LLM ran out of output tokens when generating a response. This often
happened \"mid sentence\", and resulted in errors or hallucinations
being displayed to users.\r\n\r\nThe new output chunking feature detects
incomplete responses from the LLM in the `generate` step of the Graph.
When an incomplete response is detected, the `generate` step will run
again with:\r\n\r\n- The original prompt\r\n- The Alerts provided as
context\r\n- The partially generated response\r\n- Instructions to
\"continue where you left off\"\r\n\r\nThe `generate` step in the graph
will run until one of the following conditions is met:\r\n\r\n- The
incomplete response can be successfully parsed\r\n- The maximum number
of generation attempts (default: `10`) is reached\r\n- The maximum
number of hallucinations detected (default: `5`) is reached\r\n\r\n####
Output refinement details\r\n\r\nThe new output refinement feature
automatically combines related discoveries (that were previously
represented as two or more discoveries).\r\n\r\nThe new `refine` step in
the graph re-submits the discoveries from the `generate` step with a
`refinePrompt` to combine related attack discoveries.\r\n\r\nThe
`refine` step is subject to the model's output token limits, just like
the `generate` step. That means a response to the refine prompt from the
LLM may be cut off \"mid\" sentence. To that end:\r\n\r\n- The refine
step will re-run until the (same, shared) `maxGenerationAttempts` and
`maxHallucinationFailures` limits as the `generate` step are
reached\r\n- The maximum number of attempts (default: `10`) is _shared_
with the `generate` step. For example, if it took `7` tries
(`generationAttempts`) to complete the `generate` step, the refine
`step` will only run up to `3` times.\r\n\r\nThe `refine` step will
return _unrefined_ results from the `generate` step when:\r\n\r\n- The
`generate` step uses all `10` generation attempts. When this happens,
the `refine` step will be skipped, and the unrefined output of the
`generate` step will be returned to the user\r\n- If the `refine` step
uses all remaining attempts, but fails to produce a refined response,
due to output token limitations, or hallucinations in the refined
response\r\n\r\n#### Hallucination detection details\r\n\r\nBefore this
PR, Attack discovery directly used lower level LangChain APIs to parse
responses from the LLM. After this PR, Attack discovery uses
LangGraph.\r\n\r\nIn the previous implementation, when Attack discovery
received an incomplete response because the output token limits of a
model were hit, the LangChain APIs automatically re-submitted the
incomplete response in an attempt to \"repair\" it. However, the
re-submitted results didn't include all of the original context (i.e.
alerts that generated them). The repair process often resulted in
hallucinated results being presented to users, especially with some
models i.e. `Claude 3.5 Haiku`.\r\n\r\nIn this PR, the `generate` and
`refine` steps detect (some) hallucinations. When hallucinations are
detected:\r\n\r\n- The current accumulated `generations` or
`refinements` are (respectively) discarded, effectively restarting the
`generate` or `refine` process\r\n- The `generate` and `refine` steps
will be retried until the maximum generation attempts (default: `10`) or
hallucinations detected (default: `5`) limits are reached\r\n\r\nHitting
the hallucination limit during the `generate` step will result in an
error being displayed to the user.\r\n\r\nHitting the hallucination
limit during the `refine` step will result in the unrefined discoveries
being displayed to the user.\r\n\r\n#### Replay alerts in evaluations
details\r\n\r\nAlerts replay makes it possible to re-run evaluations,
even when your local deployment has zero alerts.\r\n\r\nThis feature
eliminates the chore of populating your local instance with specific
alerts for each example.\r\n\r\nEvery example in a dataset may
(optionally) specify a different set of alerts.\r\n\r\nAlert replay
skips the `retrieve_anonymized_alerts` step in the graph, because it
uses the `anonymizedAlerts` and `replacements` provided as `Input` in a
dataset example.\r\n\r\nThe following instructions document the process
of creating a new LangSmith dataset example that uses the Alerts replay
feature:\r\n\r\n1) In Kibana, navigate to Security > Attack
discovery\r\n\r\n2) Click `Generate` to generate Attack
discoveries\r\n\r\n3) In LangSmith, navigate to Projects > _Your
project_\r\n\r\n4) In the `Runs` tab of the LangSmith project, click on
the latest `Attack discovery` entry to open the trace\r\n\r\n5)
**IMPORTANT**: In the trace, select the **LAST**
`ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry. The
last entry will appear inside the **LAST** `refine` step in the trace,
as illustrated by the screenshot
below:\r\n\r\n![last_channel_write](https://github.com/user-attachments/assets/c57fc803-3bbb-4603-b99f-d2b130428201)\r\n\r\n6)
With the last `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...`
entry selected, click `Add to` > `Add to Dataset`\r\n\r\n7) Copy-paste
the `Input` to the `Output`, because evaluation Experiments always
compare the current run with the `Output` in an example.\r\n\r\n- This
step is _always_ required to create a dataset.\r\n- If you don't want to
use the Alert replay feature, replace `Input` with an empty
object:\r\n\r\n```json\r\n{}\r\n```\r\n\r\n8) Choose an existing
dataset, or create a new one\r\n\r\n9) Click the `Submit` button to add
the example to the dataset.\r\n\r\nAfter completing the steps above, the
dataset is ready to be run in evaluations.\r\n\r\n#### Override graph
state details\r\n\r\nWhen a dataset is run in an evaluation (to create
Experiments):\r\n\r\n- The (optional) `anonymizedAlerts` and
`replacements` provided as `Input` in the example will be replayed,
bypassing the `retrieve_anonymized_alerts` step in the graph\r\n- The
rest of the properties in `Input` will not be used as inputs to the
graph\r\n- In contrast, an empty object `{}` in `Input` means the latest
and riskiest alerts in the last 24 hours in the local environment will
be queried\r\n\r\nIn addition to the above, you may add an optional
`overrides` key in the `Input` of a dataset example to test changes or
edge cases. This is useful for evaluating changes without updating the
code directly.\r\n\r\nThe `overrides` set the initial state of the graph
before it's run in an evaluation.\r\n\r\nThe example `Input` below
overrides the prompts used in the `generate` and `refine`
steps:\r\n\r\n```json\r\n{\r\n \"overrides\": {\r\n \"refinePrompt\":
\"This overrides the refine prompt\",\r\n \"attackDiscoveryPrompt\":
\"This overrides the attack discovery prompt\"\r\n
}\r\n}\r\n```\r\n\r\nTo use the `overrides` feature in evaluations to
set the initial state of the graph:\r\n\r\n1) Create a dataset example,
as documented in the _Replay alerts in evaluations details_ section
above\r\n\r\n2) In LangSmith, navigate to Datasets & Testing > _Your
Dataset_\r\n\r\n3) In the dataset, click the Examples tab\r\n\r\n4)
Click an example to open it in the flyout\r\n\r\n5) Click the `Edit`
button to edit the example\r\n\r\n6) Add the `overrides` key shown below
to the `Input` e.g.:\r\n\r\n```json\r\n{\r\n \"overrides\": {\r\n
\"refinePrompt\": \"This overrides the refine prompt\",\r\n
\"attackDiscoveryPrompt\": \"This overrides the attack discovery
prompt\"\r\n }\r\n}\r\n```\r\n\r\n7) Edit the `overrides` in the example
`Input` above to add (or remove) entries that will determine the initial
state of the graph.\r\n\r\nAll of the `overides` shown in step 6 are
optional. The `refinePrompt` and `attackDiscoveryPrompt` could be
removed from the `overrides` example above, and replaced with
`maxGenerationAttempts` to test a higher limit.\r\n\r\nAll valid graph
state may be specified in
`overrides`.","sha":"2c21adb8faafc0016ad7a6591837118f6bdf0907","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","v9.0.0","Team:
SecuritySolution","ci:cloud-deploy","ci:cloud-persist-deployment","Team:Security
Generative AI","v8.16.0","backport:version"],"title":"[Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation
improvements","number":195669,"url":"https://github.com/elastic/kibana/pull/195669","mergeCommit":{"message":"[Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation improvements (#195669)\n\n## [Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation improvements\r\n\r\n### Summary\r\n\r\nThis PR
improves the Attack discovery user and developer experience with output
chunking / refinement, migration to LangGraph, and improvements to
evaluations.\r\n\r\nThe improvements were realized by transitioning from
directly using lower-level LangChain apis to LangGraph in this PR, and a
deeper integration with the evaluation features of
LangSmith.\r\n\r\n#### Output chunking\r\n\r\n_Output chunking_
increases the maximum and default number of alerts sent as context,
working around the output token limitations of popular large language
models (LLMs):\r\n\r\n| | Old | New
|\r\n|----------------|-------|-------|\r\n| max alerts | `100` | `500`
|\r\n| default alerts | `20` | `200` |\r\n\r\nSee _Output chunking
details_ below for more information.\r\n\r\n#### Settings\r\n\r\nA new
settings modal makes it possible to configure the number of alerts sent
as context directly from the Attack discovery
page:\r\n\r\n![settings](https://github.com/user-attachments/assets/3f5ab4e9-5eae-4f99-8490-e392c758fa6e)\r\n\r\n-
Previously, users configured this value for Attack discovery via the
security assistant Knowledge base settings, as documented
[here](https://www.elastic.co/guide/en/security/8.15/attack-discovery.html#attack-discovery-generate-discoveries)\r\n-
The new settings modal uses local storage (instead of the
previously-shared assistant Knowledge base setting, which is stored in
Elasticsearch)\r\n\r\n#### Output refinement\r\n\r\n_Output refinement_
automatically combines related discoveries (that were previously
represented as two or more discoveries):\r\n\r\n
![default_attack_discovery_graph](https://github.com/user-attachments/assets/c092bb42-a41e-4fba-85c2-a4b2c1ef3053)\r\n\r\n-
The `refine` step in the graph diagram above may (for example), combine
three discoveries from the `generate` step into two discoveries when
they are related\r\n\r\n### Hallucination detection\r\n\r\nNew
_hallucination detection_ displays an error in lieu of showing
hallucinated
output:\r\n\r\n![hallucination_detection](https://github.com/user-attachments/assets/1d849908-3f10-4fe8-8741-c0cf418b1524)\r\n\r\n-
A new tour step was added to the Attack discovery page to share the
improvements:\r\n\r\n![tour_step](https://github.com/user-attachments/assets/0cedf770-baba-41b1-8ec6-b12b14c0c57a)\r\n\r\n###
Summary of improvements for developers\r\n\r\nThe following features
improve the developer experience when running evaluations for Attack
discovery:\r\n\r\n#### Replay alerts in evaluations\r\n\r\nThis
evaluation feature eliminates the need to populate a local environment
with alerts to (re)run evaluations:\r\n\r\n
![alerts_as_input](https://github.com/user-attachments/assets/b29dc847-3d53-4b17-8757-ed59852c1623)\r\n\r\nAlert
replay skips the `retrieve_anonymized_alerts` step in the graph, because
it uses the `anonymizedAlerts` and `replacements` provided as `Input` in
a dataset example. See _Replay alerts in evaluations details_ below for
more information.\r\n\r\n#### Override graph state\r\n\r\nOverride graph
state via datatset examples to test prompt improvements and edge cases
via evaluations:\r\n\r\n
![override_graph_input](https://github.com/user-attachments/assets/a685177b-1e07-4f49-9b8d-c0b652975237)\r\n\r\nTo
use this feature, add an `overrides` key to the `Input` of a dataset
example. See _Override graph state details_ below for more
information.\r\n\r\n#### New custom evaluator\r\n\r\nPrior to this PR,
an evaluator had to be manually added to each dataset in LangSmith to
use an LLM as the judge for correctness.\r\n\r\nThis PR introduces a
custom, programmatic evaluator that handles anonymization automatically,
and eliminates the need to manually create evaluators in LangSmith. To
use it, simply run evaluations from the `Evaluation` tab in
settings.\r\n\r\n#### New evaluation settings\r\n\r\nThis PR introduces
new settings in the `Evaluation`
tab:\r\n\r\n![new_evaluation_settings](https://github.com/user-attachments/assets/ca72aa2a-b0dc-4bec-9409-386d77d6a2f4)\r\n\r\nNew
evaluation settings:\r\n\r\n- `Evaluator model (optional)` - Judge the
quality of predictions using a single model. (Default: use the same
model as the connector)\r\n\r\nThis new setting is useful when you want
to use the same model, e.g. `GPT-4o` to judge the quality of all the
models evaluated in an experiment.\r\n\r\n- `Default max alerts` - The
default maximum number of alerts to send as context, which may be
overridden by the example input\r\n\r\nThis new setting is useful when
using the alerts in the local environment to run evaluations. Examples
that use the Alerts replay feature will ignore this value, because the
alerts in the example `Input` will be used instead.\r\n\r\n####
Directory structure refactoring\r\n\r\n- The server-side directory
structure was refactored to consolidate the location of Attack discovery
related files\r\n\r\n### Details\r\n\r\nThis section describes some of
the improvements above in detail.\r\n\r\n#### Output chunking
details\r\n\r\nThe new output chunking feature increases the maximum and
default number of alerts that may be sent as context. It achieves this
improvement by working around output token limitations.\r\n\r\nLLMs have
different limits for the number of tokens accepted as _input_ for
requests, and the number of tokens available for _output_ when
generating responses.\r\n\r\nToday, the output token limits of most
popular models are significantly smaller than the input token
limits.\r\n\r\nFor example, at the time of this writing, the Gemini 1.5
Pro model's limits are
([source](https://ai.google.dev/gemini-api/docs/models/gemini)):\r\n\r\n-
Input token limit: `2,097,152`\r\n- Output token limit:
`8,192`\r\n\r\nAs a result of this relatively smaller output token
limit, previous versions of Attack discovery would simply fail when an
LLM ran out of output tokens when generating a response. This often
happened \"mid sentence\", and resulted in errors or hallucinations
being displayed to users.\r\n\r\nThe new output chunking feature detects
incomplete responses from the LLM in the `generate` step of the Graph.
When an incomplete response is detected, the `generate` step will run
again with:\r\n\r\n- The original prompt\r\n- The Alerts provided as
context\r\n- The partially generated response\r\n- Instructions to
\"continue where you left off\"\r\n\r\nThe `generate` step in the graph
will run until one of the following conditions is met:\r\n\r\n- The
incomplete response can be successfully parsed\r\n- The maximum number
of generation attempts (default: `10`) is reached\r\n- The maximum
number of hallucinations detected (default: `5`) is reached\r\n\r\n####
Output refinement details\r\n\r\nThe new output refinement feature
automatically combines related discoveries (that were previously
represented as two or more discoveries).\r\n\r\nThe new `refine` step in
the graph re-submits the discoveries from the `generate` step with a
`refinePrompt` to combine related attack discoveries.\r\n\r\nThe
`refine` step is subject to the model's output token limits, just like
the `generate` step. That means a response to the refine prompt from the
LLM may be cut off \"mid\" sentence. To that end:\r\n\r\n- The refine
step will re-run until the (same, shared) `maxGenerationAttempts` and
`maxHallucinationFailures` limits as the `generate` step are
reached\r\n- The maximum number of attempts (default: `10`) is _shared_
with the `generate` step. For example, if it took `7` tries
(`generationAttempts`) to complete the `generate` step, the refine
`step` will only run up to `3` times.\r\n\r\nThe `refine` step will
return _unrefined_ results from the `generate` step when:\r\n\r\n- The
`generate` step uses all `10` generation attempts. When this happens,
the `refine` step will be skipped, and the unrefined output of the
`generate` step will be returned to the user\r\n- If the `refine` step
uses all remaining attempts, but fails to produce a refined response,
due to output token limitations, or hallucinations in the refined
response\r\n\r\n#### Hallucination detection details\r\n\r\nBefore this
PR, Attack discovery directly used lower level LangChain APIs to parse
responses from the LLM. After this PR, Attack discovery uses
LangGraph.\r\n\r\nIn the previous implementation, when Attack discovery
received an incomplete response because the output token limits of a
model were hit, the LangChain APIs automatically re-submitted the
incomplete response in an attempt to \"repair\" it. However, the
re-submitted results didn't include all of the original context (i.e.
alerts that generated them). The repair process often resulted in
hallucinated results being presented to users, especially with some
models i.e. `Claude 3.5 Haiku`.\r\n\r\nIn this PR, the `generate` and
`refine` steps detect (some) hallucinations. When hallucinations are
detected:\r\n\r\n- The current accumulated `generations` or
`refinements` are (respectively) discarded, effectively restarting the
`generate` or `refine` process\r\n- The `generate` and `refine` steps
will be retried until the maximum generation attempts (default: `10`) or
hallucinations detected (default: `5`) limits are reached\r\n\r\nHitting
the hallucination limit during the `generate` step will result in an
error being displayed to the user.\r\n\r\nHitting the hallucination
limit during the `refine` step will result in the unrefined discoveries
being displayed to the user.\r\n\r\n#### Replay alerts in evaluations
details\r\n\r\nAlerts replay makes it possible to re-run evaluations,
even when your local deployment has zero alerts.\r\n\r\nThis feature
eliminates the chore of populating your local instance with specific
alerts for each example.\r\n\r\nEvery example in a dataset may
(optionally) specify a different set of alerts.\r\n\r\nAlert replay
skips the `retrieve_anonymized_alerts` step in the graph, because it
uses the `anonymizedAlerts` and `replacements` provided as `Input` in a
dataset example.\r\n\r\nThe following instructions document the process
of creating a new LangSmith dataset example that uses the Alerts replay
feature:\r\n\r\n1) In Kibana, navigate to Security > Attack
discovery\r\n\r\n2) Click `Generate` to generate Attack
discoveries\r\n\r\n3) In LangSmith, navigate to Projects > _Your
project_\r\n\r\n4) In the `Runs` tab of the LangSmith project, click on
the latest `Attack discovery` entry to open the trace\r\n\r\n5)
**IMPORTANT**: In the trace, select the **LAST**
`ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry. The
last entry will appear inside the **LAST** `refine` step in the trace,
as illustrated by the screenshot
below:\r\n\r\n![last_channel_write](https://github.com/user-attachments/assets/c57fc803-3bbb-4603-b99f-d2b130428201)\r\n\r\n6)
With the last `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...`
entry selected, click `Add to` > `Add to Dataset`\r\n\r\n7) Copy-paste
the `Input` to the `Output`, because evaluation Experiments always
compare the current run with the `Output` in an example.\r\n\r\n- This
step is _always_ required to create a dataset.\r\n- If you don't want to
use the Alert replay feature, replace `Input` with an empty
object:\r\n\r\n```json\r\n{}\r\n```\r\n\r\n8) Choose an existing
dataset, or create a new one\r\n\r\n9) Click the `Submit` button to add
the example to the dataset.\r\n\r\nAfter completing the steps above, the
dataset is ready to be run in evaluations.\r\n\r\n#### Override graph
state details\r\n\r\nWhen a dataset is run in an evaluation (to create
Experiments):\r\n\r\n- The (optional) `anonymizedAlerts` and
`replacements` provided as `Input` in the example will be replayed,
bypassing the `retrieve_anonymized_alerts` step in the graph\r\n- The
rest of the properties in `Input` will not be used as inputs to the
graph\r\n- In contrast, an empty object `{}` in `Input` means the latest
and riskiest alerts in the last 24 hours in the local environment will
be queried\r\n\r\nIn addition to the above, you may add an optional
`overrides` key in the `Input` of a dataset example to test changes or
edge cases. This is useful for evaluating changes without updating the
code directly.\r\n\r\nThe `overrides` set the initial state of the graph
before it's run in an evaluation.\r\n\r\nThe example `Input` below
overrides the prompts used in the `generate` and `refine`
steps:\r\n\r\n```json\r\n{\r\n \"overrides\": {\r\n \"refinePrompt\":
\"This overrides the refine prompt\",\r\n \"attackDiscoveryPrompt\":
\"This overrides the attack discovery prompt\"\r\n
}\r\n}\r\n```\r\n\r\nTo use the `overrides` feature in evaluations to
set the initial state of the graph:\r\n\r\n1) Create a dataset example,
as documented in the _Replay alerts in evaluations details_ section
above\r\n\r\n2) In LangSmith, navigate to Datasets & Testing > _Your
Dataset_\r\n\r\n3) In the dataset, click the Examples tab\r\n\r\n4)
Click an example to open it in the flyout\r\n\r\n5) Click the `Edit`
button to edit the example\r\n\r\n6) Add the `overrides` key shown below
to the `Input` e.g.:\r\n\r\n```json\r\n{\r\n \"overrides\": {\r\n
\"refinePrompt\": \"This overrides the refine prompt\",\r\n
\"attackDiscoveryPrompt\": \"This overrides the attack discovery
prompt\"\r\n }\r\n}\r\n```\r\n\r\n7) Edit the `overrides` in the example
`Input` above to add (or remove) entries that will determine the initial
state of the graph.\r\n\r\nAll of the `overides` shown in step 6 are
optional. The `refinePrompt` and `attackDiscoveryPrompt` could be
removed from the `overrides` example above, and replaced with
`maxGenerationAttempts` to test a higher limit.\r\n\r\nAll valid graph
state may be specified in
`overrides`.","sha":"2c21adb8faafc0016ad7a6591837118f6bdf0907"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/195669","number":195669,"mergeCommit":{"message":"[Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation improvements (#195669)\n\n## [Security
Solution] [Attack discovery] Output chunking / refinement, LangGraph
migration, and evaluation improvements\r\n\r\n### Summary\r\n\r\nThis PR
improves the Attack discovery user and developer experience with output
chunking / refinement, migration to LangGraph, and improvements to
evaluations.\r\n\r\nThe improvements were realized by transitioning from
directly using lower-level LangChain apis to LangGraph in this PR, and a
deeper integration with the evaluation features of
LangSmith.\r\n\r\n#### Output chunking\r\n\r\n_Output chunking_
increases the maximum and default number of alerts sent as context,
working around the output token limitations of popular large language
models (LLMs):\r\n\r\n| | Old | New
|\r\n|----------------|-------|-------|\r\n| max alerts | `100` | `500`
|\r\n| default alerts | `20` | `200` |\r\n\r\nSee _Output chunking
details_ below for more information.\r\n\r\n#### Settings\r\n\r\nA new
settings modal makes it possible to configure the number of alerts sent
as context directly from the Attack discovery
page:\r\n\r\n![settings](https://github.com/user-attachments/assets/3f5ab4e9-5eae-4f99-8490-e392c758fa6e)\r\n\r\n-
Previously, users configured this value for Attack discovery via the
security assistant Knowledge base settings, as documented
[here](https://www.elastic.co/guide/en/security/8.15/attack-discovery.html#attack-discovery-generate-discoveries)\r\n-
The new settings modal uses local storage (instead of the
previously-shared assistant Knowledge base setting, which is stored in
Elasticsearch)\r\n\r\n#### Output refinement\r\n\r\n_Output refinement_
automatically combines related discoveries (that were previously
represented as two or more discoveries):\r\n\r\n
![default_attack_discovery_graph](https://github.com/user-attachments/assets/c092bb42-a41e-4fba-85c2-a4b2c1ef3053)\r\n\r\n-
The `refine` step in the graph diagram above may (for example), combine
three discoveries from the `generate` step into two discoveries when
they are related\r\n\r\n### Hallucination detection\r\n\r\nNew
_hallucination detection_ displays an error in lieu of showing
hallucinated
output:\r\n\r\n![hallucination_detection](https://github.com/user-attachments/assets/1d849908-3f10-4fe8-8741-c0cf418b1524)\r\n\r\n-
A new tour step was added to the Attack discovery page to share the
improvements:\r\n\r\n![tour_step](https://github.com/user-attachments/assets/0cedf770-baba-41b1-8ec6-b12b14c0c57a)\r\n\r\n###
Summary of improvements for developers\r\n\r\nThe following features
improve the developer experience when running evaluations for Attack
discovery:\r\n\r\n#### Replay alerts in evaluations\r\n\r\nThis
evaluation feature eliminates the need to populate a local environment
with alerts to (re)run evaluations:\r\n\r\n
![alerts_as_input](https://github.com/user-attachments/assets/b29dc847-3d53-4b17-8757-ed59852c1623)\r\n\r\nAlert
replay skips the `retrieve_anonymized_alerts` step in the graph, because
it uses the `anonymizedAlerts` and `replacements` provided as `Input` in
a dataset example. See _Replay alerts in evaluations details_ below for
more information.\r\n\r\n#### Override graph state\r\n\r\nOverride graph
state via datatset examples to test prompt improvements and edge cases
via evaluations:\r\n\r\n
![override_graph_input](https://github.com/user-attachments/assets/a685177b-1e07-4f49-9b8d-c0b652975237)\r\n\r\nTo
use this feature, add an `overrides` key to the `Input` of a dataset
example. See _Override graph state details_ below for more
information.\r\n\r\n#### New custom evaluator\r\n\r\nPrior to this PR,
an evaluator had to be manually added to each dataset in LangSmith to
use an LLM as the judge for correctness.\r\n\r\nThis PR introduces a
custom, programmatic evaluator that handles anonymization automatically,
and eliminates the need to manually create evaluators in LangSmith. To
use it, simply run evaluations from the `Evaluation` tab in
settings.\r\n\r\n#### New evaluation settings\r\n\r\nThis PR introduces
new settings in the `Evaluation`
tab:\r\n\r\n![new_evaluation_settings](https://github.com/user-attachments/assets/ca72aa2a-b0dc-4bec-9409-386d77d6a2f4)\r\n\r\nNew
evaluation settings:\r\n\r\n- `Evaluator model (optional)` - Judge the
quality of predictions using a single model. (Default: use the same
model as the connector)\r\n\r\nThis new setting is useful when you want
to use the same model, e.g. `GPT-4o` to judge the quality of all the
models evaluated in an experiment.\r\n\r\n- `Default max alerts` - The
default maximum number of alerts to send as context, which may be
overridden by the example input\r\n\r\nThis new setting is useful when
using the alerts in the local environment to run evaluations. Examples
that use the Alerts replay feature will ignore this value, because the
alerts in the example `Input` will be used instead.\r\n\r\n####
Directory structure refactoring\r\n\r\n- The server-side directory
structure was refactored to consolidate the location of Attack discovery
related files\r\n\r\n### Details\r\n\r\nThis section describes some of
the improvements above in detail.\r\n\r\n#### Output chunking
details\r\n\r\nThe new output chunking feature increases the maximum and
default number of alerts that may be sent as context. It achieves this
improvement by working around output token limitations.\r\n\r\nLLMs have
different limits for the number of tokens accepted as _input_ for
requests, and the number of tokens available for _output_ when
generating responses.\r\n\r\nToday, the output token limits of most
popular models are significantly smaller than the input token
limits.\r\n\r\nFor example, at the time of this writing, the Gemini 1.5
Pro model's limits are
([source](https://ai.google.dev/gemini-api/docs/models/gemini)):\r\n\r\n-
Input token limit: `2,097,152`\r\n- Output token limit:
`8,192`\r\n\r\nAs a result of this relatively smaller output token
limit, previous versions of Attack discovery would simply fail when an
LLM ran out of output tokens when generating a response. This often
happened \"mid sentence\", and resulted in errors or hallucinations
being displayed to users.\r\n\r\nThe new output chunking feature detects
incomplete responses from the LLM in the `generate` step of the Graph.
When an incomplete response is detected, the `generate` step will run
again with:\r\n\r\n- The original prompt\r\n- The Alerts provided as
context\r\n- The partially generated response\r\n- Instructions to
\"continue where you left off\"\r\n\r\nThe `generate` step in the graph
will run until one of the following conditions is met:\r\n\r\n- The
incomplete response can be successfully parsed\r\n- The maximum number
of generation attempts (default: `10`) is reached\r\n- The maximum
number of hallucinations detected (default: `5`) is reached\r\n\r\n####
Output refinement details\r\n\r\nThe new output refinement feature
automatically combines related discoveries (that were previously
represented as two or more discoveries).\r\n\r\nThe new `refine` step in
the graph re-submits the discoveries from the `generate` step with a
`refinePrompt` to combine related attack discoveries.\r\n\r\nThe
`refine` step is subject to the model's output token limits, just like
the `generate` step. That means a response to the refine prompt from the
LLM may be cut off \"mid\" sentence. To that end:\r\n\r\n- The refine
step will re-run until the (same, shared) `maxGenerationAttempts` and
`maxHallucinationFailures` limits as the `generate` step are
reached\r\n- The maximum number of attempts (default: `10`) is _shared_
with the `generate` step. For example, if it took `7` tries
(`generationAttempts`) to complete the `generate` step, the refine
`step` will only run up to `3` times.\r\n\r\nThe `refine` step will
return _unrefined_ results from the `generate` step when:\r\n\r\n- The
`generate` step uses all `10` generation attempts. When this happens,
the `refine` step will be skipped, and the unrefined output of the
`generate` step will be returned to the user\r\n- If the `refine` step
uses all remaining attempts, but fails to produce a refined response,
due to output token limitations, or hallucinations in the refined
response\r\n\r\n#### Hallucination detection details\r\n\r\nBefore this
PR, Attack discovery directly used lower level LangChain APIs to parse
responses from the LLM. After this PR, Attack discovery uses
LangGraph.\r\n\r\nIn the previous implementation, when Attack discovery
received an incomplete response because the output token limits of a
model were hit, the LangChain APIs automatically re-submitted the
incomplete response in an attempt to \"repair\" it. However, the
re-submitted results didn't include all of the original context (i.e.
alerts that generated them). The repair process often resulted in
hallucinated results being presented to users, especially with some
models i.e. `Claude 3.5 Haiku`.\r\n\r\nIn this PR, the `generate` and
`refine` steps detect (some) hallucinations. When hallucinations are
detected:\r\n\r\n- The current accumulated `generations` or
`refinements` are (respectively) discarded, effectively restarting the
`generate` or `refine` process\r\n- The `generate` and `refine` steps
will be retried until the maximum generation attempts (default: `10`) or
hallucinations detected (default: `5`) limits are reached\r\n\r\nHitting
the hallucination limit during the `generate` step will result in an
error being displayed to the user.\r\n\r\nHitting the hallucination
limit during the `refine` step will result in the unrefined discoveries
being displayed to the user.\r\n\r\n#### Replay alerts in evaluations
details\r\n\r\nAlerts replay makes it possible to re-run evaluations,
even when your local deployment has zero alerts.\r\n\r\nThis feature
eliminates the chore of populating your local instance with specific
alerts for each example.\r\n\r\nEvery example in a dataset may
(optionally) specify a different set of alerts.\r\n\r\nAlert replay
skips the `retrieve_anonymized_alerts` step in the graph, because it
uses the `anonymizedAlerts` and `replacements` provided as `Input` in a
dataset example.\r\n\r\nThe following instructions document the process
of creating a new LangSmith dataset example that uses the Alerts replay
feature:\r\n\r\n1) In Kibana, navigate to Security > Attack
discovery\r\n\r\n2) Click `Generate` to generate Attack
discoveries\r\n\r\n3) In LangSmith, navigate to Projects > _Your
project_\r\n\r\n4) In the `Runs` tab of the LangSmith project, click on
the latest `Attack discovery` entry to open the trace\r\n\r\n5)
**IMPORTANT**: In the trace, select the **LAST**
`ChannelWriteChannelWrite<attackDiscoveries,attackDisc...` entry. The
last entry will appear inside the **LAST** `refine` step in the trace,
as illustrated by the screenshot
below:\r\n\r\n![last_channel_write](https://github.com/user-attachments/assets/c57fc803-3bbb-4603-b99f-d2b130428201)\r\n\r\n6)
With the last `ChannelWriteChannelWrite<attackDiscoveries,attackDisc...`
entry selected, click `Add to` > `Add to Dataset`\r\n\r\n7) Copy-paste
the `Input` to the `Output`, because evaluation Experiments always
compare the current run with the `Output` in an example.\r\n\r\n- This
step is _always_ required to create a dataset.\r\n- If you don't want to
use the Alert replay feature, replace `Input` with an empty
object:\r\n\r\n```json\r\n{}\r\n```\r\n\r\n8) Choose an existing
dataset, or create a new one\r\n\r\n9) Click the `Submit` button to add
the example to the dataset.\r\n\r\nAfter completing the steps above, the
dataset is ready to be run in evaluations.\r\n\r\n#### Override graph
state details\r\n\r\nWhen a dataset is run in an evaluation (to create
Experiments):\r\n\r\n- The (optional) `anonymizedAlerts` and
`replacements` provided as `Input` in the example will be replayed,
bypassing the `retrieve_anonymized_alerts` step in the graph\r\n- The
rest of the properties in `Input` will not be used as inputs to the
graph\r\n- In contrast, an empty object `{}` in `Input` means the latest
and riskiest alerts in the last 24 hours in the local environment will
be queried\r\n\r\nIn addition to the above, you may add an optional
`overrides` key in the `Input` of a dataset example to test changes or
edge cases. This is useful for evaluating changes without updating the
code directly.\r\n\r\nThe `overrides` set the initial state of the graph
before it's run in an evaluation.\r\n\r\nThe example `Input` below
overrides the prompts used in the `generate` and `refine`
steps:\r\n\r\n```json\r\n{\r\n \"overrides\": {\r\n \"refinePrompt\":
\"This overrides the refine prompt\",\r\n \"attackDiscoveryPrompt\":
\"This overrides the attack discovery prompt\"\r\n
}\r\n}\r\n```\r\n\r\nTo use the `overrides` feature in evaluations to
set the initial state of the graph:\r\n\r\n1) Create a dataset example,
as documented in the _Replay alerts in evaluations details_ section
above\r\n\r\n2) In LangSmith, navigate to Datasets & Testing > _Your
Dataset_\r\n\r\n3) In the dataset, click the Examples tab\r\n\r\n4)
Click an example to open it in the flyout\r\n\r\n5) Click the `Edit`
button to edit the example\r\n\r\n6) Add the `overrides` key shown below
to the `Input` e.g.:\r\n\r\n```json\r\n{\r\n \"overrides\": {\r\n
\"refinePrompt\": \"This overrides the refine prompt\",\r\n
\"attackDiscoveryPrompt\": \"This overrides the attack discovery
prompt\"\r\n }\r\n}\r\n```\r\n\r\n7) Edit the `overrides` in the example
`Input` above to add (or remove) entries that will determine the initial
state of the graph.\r\n\r\nAll of the `overides` shown in step 6 are
optional. The `refinePrompt` and `attackDiscoveryPrompt` could be
removed from the `overrides` example above, and replaced with
`maxGenerationAttempts` to test a higher limit.\r\n\r\nAll valid graph
state may be specified in
`overrides`.","sha":"2c21adb8faafc0016ad7a6591837118f6bdf0907"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Andrew Macri <andrew.macri@elastic.co>
jbudz added a commit that referenced this pull request Oct 15, 2024
…nement, LangGraph migration, and evaluation improvements (#195669)"

This reverts commit 2c21adb.
@jbudz jbudz added the reverted label Oct 15, 2024
@jbudz
Copy link
Member

jbudz commented Oct 15, 2024

9.0 was reverted with dbe6d82.
https://buildkite.com/elastic/kibana-on-merge/builds/52349

I'm monitoring 8.x, will only revert if required.

@andrew-goldstein
Copy link
Contributor Author

Before merging, I asked the ops team in this thread if this PR was safe to merge, because the build state was out of sync with the build comments.

The build comments being out of sync is a known build issue, so I got an explicit OK / safe ✅ to merge in that thread, but another PR that added a new lint rule merged first:
#195456

The new lint rule requires all imports from react-use to change from:

import { useLocalStorage } from 'react-use';

to:

import useLocalStorage from 'react-use/lib/useLocalStorage';

As a result, this PR started failing CI, but only after it merged to main.

jbudz added a commit to jbudz/kibana that referenced this pull request Oct 15, 2024
…inement, LangGraph migration, and evaluation improvements (elastic#195669)"

This reverts commit dbe6d82.
jbudz added a commit that referenced this pull request Oct 16, 2024
…inement, LangGraph migration, and evaluation improvements (#195669)" (#196440)

#195669 + #196381

This reverts commit dbe6d82.

---------

Co-authored-by: Alex Szabo <alex.szabo@elastic.co>
@jbudz
Copy link
Member

jbudz commented Oct 16, 2024

The revert has been re-applied in #196440 / ad2ac71. Dropping the revert label.

@jbudz jbudz removed the reverted label Oct 16, 2024
@andrew-goldstein
Copy link
Contributor Author

Thank you @jbudz!

andrew-goldstein added a commit to andrew-goldstein/kibana that referenced this pull request Oct 18, 2024
…covery max alerts for users still using legacy models

In consideration of users still using legacy models, (e.g. GPT-4 instead of GPT-4o), this PR updates `DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` from its previous value `200` in <elastic#195669> to `100`.

This PR also includes additional tests.

## Desk testing

1) Navigate to Security > Attack discovery

2) Click the settings gear

3) Select any value above or below `100` in the Alerts range slider

4) Click `Reset`

**Expected result**

- The range slider resets to `100`
andrew-goldstein added a commit that referenced this pull request Oct 18, 2024
…ry max alerts for users still using legacy models (#196939)

### [Security Solution] [Attack discovery] Updates default Attack discovery max alerts for users still using legacy models

In consideration of users still using legacy models, (e.g. GPT-4 instead of GPT-4o), this PR updates `DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` from its previous value `200` in <#195669> to `100`.

This PR also includes additional tests.

## Desk testing

1) Navigate to Security > Attack discovery

2) Click the settings gear

3) Select any value above or below `100` in the Alerts range slider

4) Click `Reset`

**Expected result**

- The range slider resets to `100`
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 18, 2024
…ry max alerts for users still using legacy models (elastic#196939)

### [Security Solution] [Attack discovery] Updates default Attack discovery max alerts for users still using legacy models

In consideration of users still using legacy models, (e.g. GPT-4 instead of GPT-4o), this PR updates `DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` from its previous value `200` in <elastic#195669> to `100`.

This PR also includes additional tests.

## Desk testing

1) Navigate to Security > Attack discovery

2) Click the settings gear

3) Select any value above or below `100` in the Alerts range slider

4) Click `Reset`

**Expected result**

- The range slider resets to `100`

(cherry picked from commit 96585a5)
kibanamachine added a commit that referenced this pull request Oct 18, 2024
…discovery max alerts for users still using legacy models (#196939) (#196959)

# Backport

This will backport the following commits from `main` to `8.16`:
- [[Security Solution] [Attack discovery] Updates default Attack
discovery max alerts for users still using legacy models
(#196939)](#196939)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Andrew
Macri","email":"andrew.macri@elastic.co"},"sourceCommit":{"committedDate":"2024-10-18T21:06:55Z","message":"[Security
Solution] [Attack discovery] Updates default Attack discovery max alerts
for users still using legacy models (#196939)\n\n### [Security Solution]
[Attack discovery] Updates default Attack discovery max alerts for users
still using legacy models\r\n\r\nIn consideration of users still using
legacy models, (e.g. GPT-4 instead of GPT-4o), this PR updates
`DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` from its previous value `200` in
<#195669> to `100`.\r\n\r\nThis PR
also includes additional tests.\r\n\r\n## Desk testing\r\n\r\n1)
Navigate to Security > Attack discovery\r\n\r\n2) Click the settings
gear\r\n\r\n3) Select any value above or below `100` in the Alerts range
slider\r\n\r\n4) Click `Reset`\r\n\r\n**Expected result**\r\n\r\n- The
range slider resets to
`100`","sha":"96585a540b2c5c717ecaf1f71cc2f6f69b4378f5","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:
SecuritySolution","Team:Security Generative
AI","v8.16.0","backport:version"],"title":"[Security Solution] [Attack
discovery] Updates default Attack discovery max alerts for users still
using legacy
models","number":196939,"url":"https://github.com/elastic/kibana/pull/196939","mergeCommit":{"message":"[Security
Solution] [Attack discovery] Updates default Attack discovery max alerts
for users still using legacy models (#196939)\n\n### [Security Solution]
[Attack discovery] Updates default Attack discovery max alerts for users
still using legacy models\r\n\r\nIn consideration of users still using
legacy models, (e.g. GPT-4 instead of GPT-4o), this PR updates
`DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` from its previous value `200` in
<#195669> to `100`.\r\n\r\nThis PR
also includes additional tests.\r\n\r\n## Desk testing\r\n\r\n1)
Navigate to Security > Attack discovery\r\n\r\n2) Click the settings
gear\r\n\r\n3) Select any value above or below `100` in the Alerts range
slider\r\n\r\n4) Click `Reset`\r\n\r\n**Expected result**\r\n\r\n- The
range slider resets to
`100`","sha":"96585a540b2c5c717ecaf1f71cc2f6f69b4378f5"}},"sourceBranch":"main","suggestedTargetBranches":["8.16"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196939","number":196939,"mergeCommit":{"message":"[Security
Solution] [Attack discovery] Updates default Attack discovery max alerts
for users still using legacy models (#196939)\n\n### [Security Solution]
[Attack discovery] Updates default Attack discovery max alerts for users
still using legacy models\r\n\r\nIn consideration of users still using
legacy models, (e.g. GPT-4 instead of GPT-4o), this PR updates
`DEFAULT_ATTACK_DISCOVERY_MAX_ALERTS` from its previous value `200` in
<#195669> to `100`.\r\n\r\nThis PR
also includes additional tests.\r\n\r\n## Desk testing\r\n\r\n1)
Navigate to Security > Attack discovery\r\n\r\n2) Click the settings
gear\r\n\r\n3) Select any value above or below `100` in the Alerts range
slider\r\n\r\n4) Click `Reset`\r\n\r\n**Expected result**\r\n\r\n- The
range slider resets to
`100`","sha":"96585a540b2c5c717ecaf1f71cc2f6f69b4378f5"}},{"branch":"8.16","label":"v8.16.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Andrew Macri <andrew.macri@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels ci:cloud-deploy Create or update a Cloud deployment ci:cloud-persist-deployment Persist cloud deployment indefinitely release_note:enhancement Team:Security Generative AI Security Generative AI Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants