Skip to content

Evaluation Mode

Henry Prince edited this page Jun 17, 2024 · 1 revision

RedFlag can be run in evaluation mode to evaluate the performance of the AI model using your own custom dataset. This mode is useful for understanding how the model and prompts perform on your codebase and aids in security risk evaluation.

Workflow

Evaluation-Workflow

Run a Example Dataset

A sample dataset has been included in this repository as example_evaluation_dataset.json.

To run this dataset, first configure your credentials and then use the eval command:

redflag eval --dataset docs/example_evaluation_dataset.json

RedFlag will then output each result as it is evaluated.

Eval-Output-Detailed

And, after working through all test cases, RedFlag will provide two test summary tables.

Eval-Output-Summary

Building a Custom Dataset

The evaluation dataset is a file containing a list of JSON-formatted objects that match the following schema:

Key Type Description
repository string The GitHub repository to retrieve the commit from.
commit string The commit SHA to evaluate.
should_review boolean Whether or not the commit should be flagged by RedFlag.
reference string A reasoning that is passed to the LLM used to grade RedFlag's response.

Here's what this schema looks like in practice:

[
    {
        "repository": "jquery/jquery",
        "commit": "0293d3e30dd68bfe92be1d6d29f9b9200d1ae917",
        "should_review": true,
        "reference": "This PR introduces a new workflow that is making use of environment variables and secrets."
    },
    {
        "repository": "jquery/jquery",
        "commit": "bde53edcf4bd6c975d068eed4eb16c5ba09c1cff",
        "should_review": false,
        "reference": "It updates only a single test, no functional code."
    }
]

Once you've created a dataset tailored to your repository, you can fine-tune the prompts to more accurately reflect the specific code risks and evaluation criteria unique to your organization.

Clone this wiki locally