-
Notifications
You must be signed in to change notification settings - Fork 7
Evaluation Mode
RedFlag can be run in evaluation mode to evaluate the performance of the AI model using your own custom dataset. This mode is useful for understanding how the model and prompts perform on your codebase and aids in security risk evaluation.
A sample dataset has been included in this repository as example_evaluation_dataset.json.
To run this dataset, first configure your credentials and then use the eval
command:
redflag eval --dataset docs/example_evaluation_dataset.json
RedFlag will then output each result as it is evaluated.
And, after working through all test cases, RedFlag will provide two test summary tables.
The evaluation dataset is a file containing a list of JSON-formatted objects that match the following schema:
Key | Type | Description |
---|---|---|
repository |
string | The GitHub repository to retrieve the commit from. |
commit |
string | The commit SHA to evaluate. |
should_review |
boolean | Whether or not the commit should be flagged by RedFlag. |
reference |
string | A reasoning that is passed to the LLM used to grade RedFlag's response. |
Here's what this schema looks like in practice:
[
{
"repository": "jquery/jquery",
"commit": "0293d3e30dd68bfe92be1d6d29f9b9200d1ae917",
"should_review": true,
"reference": "This PR introduces a new workflow that is making use of environment variables and secrets."
},
{
"repository": "jquery/jquery",
"commit": "bde53edcf4bd6c975d068eed4eb16c5ba09c1cff",
"should_review": false,
"reference": "It updates only a single test, no functional code."
}
]
Once you've created a dataset tailored to your repository, you can fine-tune the prompts to more accurately reflect the specific code risks and evaluation criteria unique to your organization.