Evaluation Mode

RedFlag can be run in evaluation mode to evaluate the performance of the AI model using your own custom dataset. This mode is useful for understanding how the model and prompts perform on your codebase and aids in security risk evaluation.

Workflow

Evaluation-Workflow

Run a Example Dataset

A sample dataset has been included in this repository as example_evaluation_dataset.json.

To run this dataset, first configure your credentials and then use the eval command:

redflag eval --dataset docs/example_evaluation_dataset.json

RedFlag will then output each result as it is evaluated.

Eval-Output-Detailed

And, after working through all test cases, RedFlag will provide two test summary tables.

Eval-Output-Summary

Building a Custom Dataset

The evaluation dataset is a file containing a list of JSON-formatted objects that match the following schema:

Key	Type	Description
`repository`	string	The GitHub repository to retrieve the commit from.
`commit`	string	The commit SHA to evaluate.
`should_review`	boolean	Whether or not the commit should be flagged by RedFlag.
`reference`	string	A reasoning that is passed to the LLM used to grade RedFlag's response.

Here's what this schema looks like in practice:

[
    {
        "repository": "jquery/jquery",
        "commit": "0293d3e30dd68bfe92be1d6d29f9b9200d1ae917",
        "should_review": true,
        "reference": "This PR introduces a new workflow that is making use of environment variables and secrets."
    },
    {
        "repository": "jquery/jquery",
        "commit": "bde53edcf4bd6c975d068eed4eb16c5ba09c1cff",
        "should_review": false,
        "reference": "It updates only a single test, no functional code."
    }
]

Once you've created a dataset tailored to your repository, you can fine-tune the prompts to more accurately reflect the specific code risks and evaluation criteria unique to your organization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Mode

Workflow

Run a Example Dataset

Building a Custom Dataset

Clone this wiki locally