GitHub - Leo-Lsc/AutoData: An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

AutoData

An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

Table of Contents

About The Project
Getting Started
- Prerequisites
- Pip-Installation
Overview
Contributing

About The Project

The data stored in language models (LMs) quickly becomes obsolete, and retraining these models from the ground up is often not feasible. Recently, various methods (e.g. SERAC, IKE, MEND, KE, ROME, MEMIT, FT-L) have been developed to inject new knowledge.

Current methods mostly perform well in editing single atom facts, but they encounter catastrophic failures when tested on the ripple effects caused by the edited knowledge. For example, if we edit the information to state that the current President of the USA is Trump, then the answer to "Who is married to Trump?" should also change accordingly. While many datasets for evaluating knowledge editing of LMs exist, they predominantly focus on facts from Wikidata, primarily relating to people and events.

In other words, the data in these datasets is homogeneous and lacks diversity. Besides, This type of dataset construction pipeline often inevitably involves parts such as manual annotation and crowdsourcing, leading to significant time and economic costs. Therefore, I implemented a framework, AutoData, that can automatically construct datasets containing various types of data based on specific needs.

Getting Started

Prerequisites

You should have at least one API key from a large language model, preferably from OpenAI.

Pip Installation

git clone https://github.com/Leo-Lsc/AutoData.git
conda create -n AutoData python=3.11.8
cd AutoData
pip install -r requirements.txt

Overview

AutoData is a framework that uses the LangChain library and OpenAI's API to automatically construct customized datasets. AutoData consists of five modules: SubjectGenerator, QA_Generator, TripleExtractor, Interrupter and TwoHopQuestionGenerator.

Contributing

If you have a suggestion that would make this better, please fork the repo and create a pull request. Any contributions you make are greatly appreciated. Don't forget to give the project a star! Thanks!

Contributors

_Leo-Lsc

Citation

Please use the following citation if you intend to use AutoData:

@misc{AutoDataFramework,
  title={AutoData},
  author={Sicheng Lai},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Leo-Lsc/AutoData}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
autodata		autodata
images		images
LICENSE.txt		LICENSE.txt
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoData

About The Project

Getting Started

Prerequisites

Pip Installation

Overview

Contributing

Contributors

Citation

About

Releases

Packages

Languages

License

Leo-Lsc/AutoData

Folders and files

Latest commit

History

Repository files navigation

AutoData

About The Project

Getting Started

Prerequisites

Pip Installation

Overview

Contributing

Contributors

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages