Skip to content

L-WWEEII/BAMBOO

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities

This repository contains the evaluation code, prompt, and datasets for the paper BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities.

BAMBOO benchmark is a comprehensive benchmark to analyze LLMs’ long text modeling. In BAMBOO benchmark, there are 10 datasets from 5 tasks, i.e., question answering, hallucination detection, langauge modeling, code completion, and text sorting. Our benchmark is constructed with the following principles:

  • Comprehensive Capacity Evaluation
  • Avoidance of Data Contamination
  • Accurate Automatic Evaluation
  • Different Length Levels

Repository Structure

  • datasets: This directory contains the data files in the benchmark. There are 10 datasets, and each dataset contains 2 files of different lengths(4k, 16k).

  • evaluate.py: This Python script is used to evaluate the outputs of your long text models.

  • prompt.json: This json file contains prompts used to evaluating your long context model.

  • requirements.txt: Python packages to be installed for evaluating your outputs.

  • private_eval: The directory contains evaluate code of private_eval datasets, which refers to PyCodeGPT

Evaluation

If you obtain outputs of each datasets, you can create a python environment BAMBOO and evaluate your outputs.

  1. Create and activate the conda environment named BAMBOO:

    conda create -n BAMBOO
    conda activate BAMBOO
  2. Install the required Python packages by running:

    pip install -r requirements.txt
  3. Run the evaluation script with your model's output. For example:

    python evaluate.py --input_path your_file.jsonl --task task

Output format

Each data point in your jsonl file should at least contains two keys:

  • pred: prediction of the model.
  • answer: the right answer.

Task Selection

Task should chosen from the list ['meetingqa','paperqa','altqa','senhallu','abshallu','meetingpred','showspred','reportsumsort','showssort','private_eval']

License

This repository is released under the MIT License.

Citation

If you use this benchmark or code in your research, please consider citing the original paper:

@article{dong2023bamboo,
  title={BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models},
  author={Dong, Zican and Tang, Tianyi and Li, Junyi and Zhao, Wayne Xin and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2309.13345},
  year={2023}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%