BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities

This repository contains the evaluation code, prompt, and datasets for the paper BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities.

BAMBOO benchmark is a comprehensive benchmark to analyze LLMs’ long text modeling. In BAMBOO benchmark, there are 10 datasets from 5 tasks, i.e., question answering, hallucination detection, langauge modeling, code completion, and text sorting. Our benchmark is constructed with the following principles:

Comprehensive Capacity Evaluation
Avoidance of Data Contamination
Accurate Automatic Evaluation
Different Length Levels

Repository Structure

datasets: This directory contains the data files in the benchmark. There are 10 datasets, and each dataset contains 2 files of different lengths(4k, 16k).
evaluate.py: This Python script is used to evaluate the outputs of your long text models.
prompt.json: This json file contains prompts used to evaluating your long context model.
requirements.txt: Python packages to be installed for evaluating your outputs.
private_eval: The directory contains evaluate code of private_eval datasets, which refers to PyCodeGPT

Evaluation

If you obtain outputs of each datasets, you can create a python environment BAMBOO and evaluate your outputs.

Create and activate the conda environment named BAMBOO:
```
conda create -n BAMBOO
conda activate BAMBOO
```
Install the required Python packages by running:
```
pip install -r requirements.txt
```
Run the evaluation script with your model's output. For example:
```
python evaluate.py --input_path your_file.jsonl --task task
```

Output format

Each data point in your jsonl file should at least contains two keys:

pred: prediction of the model.
answer: the right answer.

Task Selection

Task should chosen from the list ['meetingqa','paperqa','altqa','senhallu','abshallu','meetingpred','showspred','reportsumsort','showssort','private_eval']

License

This repository is released under the MIT License.

Citation

If you use this benchmark or code in your research, please consider citing the original paper:

@article{dong2023bamboo,
  title={BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models},
  author={Dong, Zican and Tang, Tianyi and Li, Junyi and Zhao, Wayne Xin and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2309.13345},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities

Repository Structure

Evaluation

Output format

Task Selection

License

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
private_eval		private_eval
README.md		README.md
evaluate.py		evaluate.py
prompt.json		prompt.json
requirements.txt		requirements.txt

L-WWEEII/BAMBOO

Folders and files

Latest commit

History

Repository files navigation

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities

Repository Structure

Evaluation

Output format

Task Selection

License

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages