Skip to content

A curated list of security-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the security implications, challenges, and advancements surrounding these powerful models.

Notifications You must be signed in to change notification settings

deltaaruna/Awesome-LLM-Safety

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️Awesome LLM-Safety🛡️Awesome

GitHub stars GitHub forks GitHub issues GitHub Last commit

English | 中文

🤗Introduction

Welcome to our Awesome-llm-safety repository! 🥰🥰🥰

🧑‍💻 Our Work

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.

✔️ Perfect for Majority

  • For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
  • For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.

🧭 How to Use this Guide

  • Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
  • In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.

Let’s start LLM Safety tutorial!


🚀Table of Contents


🔐Security

📑Papers

Date Institute Publication Paper
20.10 Facebook AI Research arxiv Recipes for Safety in Open-domain Chatbots
22.03 OpenAI NIPS2022 Training language models to follow instructions with human feedback
23.07 UC Berkeley NIPS2023 Jailbroken: How Does LLM Safety Training Fail?
23.12 OpenAI Open AI Practices for Governing Agentic AI Systems

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
22.02 Toxicity Detection API Perspective API link
paper
23.07 Repository Awesome LLM Security link
23.10 Tutorials Awesome-LLM-Safety link

Other

👉Latest&Comprehensive Security Paper


🔏Privacy

📑Papers

Date Institute Publication Paper
19.12 Microsoft CCS2020 Analyzing Information Leakage of Updates to Natural Language Models
21.07 Google Research ACL2022 Deduplicating Training Data Makes Language Models Better
21.10 Stanford ICLR2022 Large language models can be strong differentially private learners
22.02 Google Research ICLR2023 Quantifying Memorization Across Neural Language Models
22.02 UNC Chapel Hill ICML2022 Deduplicating Training Data Mitigates Privacy Risks in Language Models

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

Other

👉Latest&Comprehensive Privacy Paper


📰Truthfulness & Misinformation

📑Papers

Date Institute Publication Paper
21.09 University of Oxford ACL2022 TruthfulQA: Measuring How Models Mimic Human Falsehoods
23.11 Harbin Institute of Technology arxiv A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
23.11 Arizona State University arxiv Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.07 Repository llm-hallucination-survey link
23.10 Repository LLM-Factuality-Survey link
23.10 Tutorials Awesome-LLM-Safety link

Other

👉Latest&Comprehensive Truthfulness&Misinformation Paper


😈JailBreak & Attacks

📑Papers

Date Institute Publication Paper
20.12 Google USENIX Security 2021 Extracting Training Data from Large Language Models
22.11 AE Studio NIPS2022(ML Safety Workshop) Ignore Previous Prompt: Attack Techniques For Language Models
23.06 Google arxiv Are aligned neural networks adversarially aligned?
23.07 CMU arxiv Universal and Transferable Adversarial Attacks on Aligned Language Models
23.10 University of Pennsylvania arxiv Jailbreaking Black Box Large Language Models in Twenty Queries

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.01 Community Reddit/ChatGPTJailbrek link
23.02 Resource&Tutorials Jailbreak Chat link
23.10 Tutorials Awesome-LLM-Safety link
23.10 Article Adversarial Attacks on LLMs(Author: Lilian Weng) link
23.11 Video [1hr Talk] Intro to Large Language Models
From 45:45(Author: Andrej Karpathy)
link
24.06 Article How Anyone can Hack ChatGPT -GPT4o!!(Author: Aruna Withanage) link

Other

👉Latest&Comprehensive JailBreak & Attacks Paper


🛡️Defenses

📑Papers

Date Institute Publication Paper
21.07 Google Research ACL2022 Deduplicating Training Data Makes Language Models Better
22.04 Anthropic arxiv Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

Other

👉Latest&Comprehensive Defenses Paper


💯Datasets & Benchmark

📑Papers

Date Institute Publication Paper
20.09 University of Washington EMNLP2020(findings) RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
21.09 University of Oxford ACL2022 TruthfulQA: Measuring How Models Mimic Human Falsehoods
22.03 MIT ACL2022 ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

📚Resource📚

Other

👉Latest&Comprehensive datasets & Benchmark Paper


🧑‍🏫 Scholars 👩‍🏫

In this section, we list some of the scholars we consider to be experts in the field of LLM Safety!

Scholars HomePage&Google Scholars Keywords or Interested
Nicholas Carlini Homepage | Google Scholar the intersection of machine learning and computer security&neural networks from an adversarial perspective
Daphne Ippolito Google Scholar Natural Language Processing
Chiyuan Zhang Homepage | Google Scholar Especially interested in understanding the generalization and memorization in machine and human learning, as well as implications in related areas like privacy
Katherine Lee Google Scholar natural language processing&translation&machine learning&computational neuroscienceattention
Florian Tramèr Homepage | Google Scholar Computer Security&Machine Learning&Cryptography&the worst-case behavior of Deep Learning systems from an adversarial perspective, to understand and mitigate long-term threats to the safety and privacy of users
Jindong Wang Homepage | Google Scholar Large Language Models (LLMs) evaluation and robustness enhancement  
Chaowei Xiao Homepage | Google Scholar interested in exploring the trustworthy problem in (MultiModal) Large Language Models and studying the role of LLMs in different application domains.
Andy Zou Homepage | Google Scholar ML Safety&AI Safety

🧑‍🎓Author

🤗If you have any questions, please contact our authors!🤗

✉️: ydyjya ➡️ zhouzhenhong@bupt.edu.cn

💬: LLM Safety Discussion


Star History Chart

⬆ Back to ToC

About

A curated list of security-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the security implications, challenges, and advancements surrounding these powerful models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published