Source code of the paper: RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering [Findings of ACL 2024]

Python 57 1 Updated May 28, 2024

agiresearch / OpenAGI

OpenAGI: When LLM Meets Domain Experts

Python 1,921 161 Updated Sep 2, 2024

google-research-datasets / xsum_hallucination_annotations

Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (https://www.aclweb.org/anthology/2020.acl-main.173.pdf).

80 6 Updated Nov 26, 2020

Beomi / ko-lm-evaluation-harness

Forked repo from https://github.com/EleutherAI/lm-evaluation-harness/commit/1f66adc

Python 62 18 Updated Feb 28, 2024

UpstageAI / evalverse

The Universe of Evaluation. All about the evaluation for LLMs.

Python 207 20 Updated Jul 9, 2024

chongyangtao / LLMs-for-NLG-Evaluation

Awesome LLM for NLG Evaluation Papers

18 Updated Jan 23, 2024

onejune2018 / Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

398 40 Updated Jul 31, 2024

tjunlp-lab / Awesome-LLMs-Evaluation-Papers

The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.

683 42 Updated May 8, 2024

wgwang / awesome-LLM-benchmarks

Awesome LLM Benchmarks to evaluate the LLMs across text, code, image, audio, video and more.

112 6 Updated Jan 3, 2024

JINU6497 / Time-series-LLM

Time-series-LLM

Python 20 3 Updated Oct 31, 2023

hongshi97 / CAD

Unofficial re-implementation of "Trusting Your Evidence: Hallucinate Less with Context-aware Decoding"

Jupyter Notebook 26 2 Updated Nov 6, 2023

kevinscaria / InstructABSA

Instructional learning for Aspect Based Sentiment Analysis [NAACL-2024]

Jupyter Notebook 135 23 Updated Jul 5, 2024

sb-jang / kodialogbench

Code and data for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" (LREC-COLING 2024)

Python 15 Updated Mar 2, 2024