Refactor context benchmark #249

granawkins · 2023-11-06T07:24:17Z

I've tweaked the prompt somewhat and setup this workflow:

(not new) Generate transcripts with pytest -s tests/benchmarks/git_log_to_transcripts.py --benchmark. This will add edited_features (directly from diff) and selected_features (output of our LLMSelector). The selected_features don't reliably include all the edits so they're kept separate.
Evaluate the LLMSelector on all benchmarks with pytest -s tests/benchmarks/context_benchmark.py --benchmark. This will cycle through all the benchmarks found and calculate recall and precision for 3 cases:
UseExpected=False, LLM=False: Just the embeddings-based preselector
UseExpected=False, LLM=True: The LLM selector with the prompt and no expected edits
UseExpected=True, LLM=True: The LLM selector, given known expected_edits from benchmark

Results are hit or miss - roughly 1/3 of the time it's perfect (recall=1.0), 1/3 it's useless (recall=0), and the rest in between.

I added another script, evolve_llm_feature_selector.py which I played around with, hasn't been super useful yet but I think is worth hanging on to.

jakethekoenig

The prompt mutator looks like a lot of fun. Hopefully we get use out of it later. In my own testing it looks like using an llm improves recall a fair bit.

jakethekoenig · 2023-11-06T21:35:37Z

mentat/code_context.py

@@ -48,6 +49,7 @@ def _get_all_features(
            abs_path.is_dir()
            or not is_file_text_encoded(abs_path)
            or abs_path in ignore_files
+            or os.path.getsize(abs_path) > max_chars


Curious why you added this. Was there some file in the repo you were trying to exclude?

Ya, there's a .ipynb that's massive in one of the repos I was working with, and a huge .json file in another.

jakethekoenig · 2023-11-06T21:59:04Z

scripts/evolve_llm_feature_selector.py

+          2. If an 'Expected Edits' list is provided to the code-selection LLM, it *must* include the lines which are expected to be edited. This is reflected in the scores below as 'Recall'. \
+          3. To also identify relevant context to the query, such as the type-definitions of variables which will be edited, or functions which would be directly affected by the edits. \
+          4. To NOT select irrelevant files or lines of code. \
+          5. It's critical respond to this with a JSON-parsable list of strings (one for each prompt). \


granawkins added 6 commits November 6, 2023 08:22

refactor benchmark context selection

8d524dc

add evolve_feature_selector script

20d65d6

remove unused files

9db081c

benchmark 3 different configurations

f3d4063

add evolve_script back in

b44c174

ruff fixes

ea91e80

jakethekoenig approved these changes Nov 6, 2023

View reviewed changes

granawkins merged commit 1587431 into main Nov 7, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor context benchmark #249

Refactor context benchmark #249

granawkins commented Nov 6, 2023 •

edited

Loading

jakethekoenig left a comment

jakethekoenig Nov 6, 2023

granawkins Nov 7, 2023

jakethekoenig Nov 6, 2023

Refactor context benchmark #249

Refactor context benchmark #249

Conversation

granawkins commented Nov 6, 2023 • edited Loading

jakethekoenig left a comment

Choose a reason for hiding this comment

jakethekoenig Nov 6, 2023

Choose a reason for hiding this comment

granawkins Nov 7, 2023

Choose a reason for hiding this comment

jakethekoenig Nov 6, 2023

Choose a reason for hiding this comment

granawkins commented Nov 6, 2023 •

edited

Loading