Security

Different from the main README🕵️

Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date	Institute	Publication	Paper	Keywords
20.10	Facebook AI Research	arxiv	Recipes for Safety in Open-domain Chatbots	Toxic Behavior&Open-domain
22.02	DeepMind	EMNLP2022	Red Teaming Language Models with Language Model	Red Teaming&Harm Test
22.03	OpenAI	NIPS2022	Training language models to follow instructions with human feedback	InstructGPT&RLHF&Harmless
22.04	Anthropic	arxiv	Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	Helpful&Harmless
22.05	UCSD	EMNLP2022	An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models	Privacy Risks&Memorization
22.09	Anthropic	arxiv	Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned	Red Teaming&Harmless&Helpful
22.12	Anthropic	arxiv	Constitutional AI: Harmlessness from AI Feedback	Harmless&Self-improvement&RLAIF
23.07	UC Berkeley	NIPS2023	Jailbroken: How Does LLM Safety Training Fail?	Jailbreak&Competing Objectives&Mismatched Generalization
23.08	The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong	arxiv	GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher	Safety Alignment&Adversarial Attack
23.08	University College London, University College London, Tilburg University	arxiv	Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities	Security&AI Alignment
23.09	Peking University	arxiv	RAIN: Your Language Models Can Align Themselves without Finetuning	Self-boosting&Rewind Mechanisms
23.10	Princeton University, Virginia Tech, IBM Research, Stanford University	arxiv	FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO!	Fine-tuningSafety Risks&Adversarial Training
23.10	UC Riverside	arXiv	Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks	Adversarial Attacks&Vulnerabilities&Model Security
23.11	KAIST AI	arxiv	HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning	Hate Speech&Detection
23.11	CMU	AACL2023(ART or Safety workshop)	Measuring Adversarial Datasets	Adversarial Robustness&AI Safety&Adversarial Datasets
23.11	UIUC	arxiv	Removing RLHF Protections in GPT-4 via Fine-Tuning	Remove Protection&Fine-Tuning
23.11	IT University of Copenhagen，University of Washington	arxiv	Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild	Red Teaming
23.11	Fudan University&Shanghai AI lab	arxiv	Fake Alignment: Are LLMs Really Aligned Well?	Alignment Failure&Safety Evaluation
23.11	University of Southern California	arxiv	SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data	RLHF&Safety
23.11	Google Research	arxiv	AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications	Adversarial Testing&AI-Assisted Red Teaming&Application Safety
23.11	Tencent AI Lab	arxiv	ADVERSARIAL PREFERENCE OPTIMIZATION	Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction
23.11	Docta.ai	arxiv	Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models	Data Credibility&Safety alignment
23.11	CIIRC CTU in Prague	arxiv	A Security Risk Taxonomy for Large Language Models	Security risks&Taxonomy&Prompt-based attacks
23.11	Meta&University of Illinois Urbana-Champaign	NAACL2024	MART: Improving LLM Safety with Multi-round Automatic Red-Teaming	Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing
23.11	The Ohio State University&University of California, Davis	NAACL2024	How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities	Open-Source LLMs&Malicious Demonstrations&Trustworthiness
23.12	Drexel University	arXiv	A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly	Security&Privacy&Attacks
23.12	Tenyx	arXiv	Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation	Geometric Interpretation&Intrinsic Dimension&Toxicity Detection
23.12	Independent (Now at Google DeepMind)	arXiv	Scaling Laws for Adversarial Attacks on Language Model Activations	Adversarial Attacks&Language Model Activations&Scaling Laws
23.12	University of Liechtenstein, University of Duesseldorf	arxiv	NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS	Negotiation&Reasoning&Prompt Hacking
23.12	University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University	arXiv	Exploring the Limits of ChatGPT in Software Security Applications	Software Security
23.12	GenAI at Meta	arxiv	Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations	Human-AI Conversation&Safety Risk taxonomy
23.12	University of California Riverside, Microsoft	arxiv	Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack	Safety Alignment&Summarization&Vulnerability
23.12	MIT, Harvard	NIPS2023(Workshop)	Forbidden Facts: An Investigation of Competing Objectives in Llama-2	Competing Objectives&Forbidden Fact Task&Model Decomposition
23.12	University of Science and Technology of China	arxiv	Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models	Text Protection&Silent Guardian
23.12	OpenAI	Open AI	Practices for Governing Agentic AI Systems	Agentic AI Systems&LM Based Agent
23.12	University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University	arxiv	Learning and Forgetting Unsafe Examples in Large Language Models	Safety Issues&ForgetFilter Algorithm&Unsafe Content
23.12	Tencent AI Lab, The Chinese University of Hong Kong	arxiv	Aligning Language Models with Judgments	Judgment Alignment&Contrastive Unlikelihood Training
24.01	Delft University of Technology	arxiv	Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks	Red Teaming&Hallucinations&Mathematics Tasks
24.01	Apart Research, University of Edinburgh, Imperial College London, University of Oxford	arxiv	Large Language Models Relearn Removed Concepts	Neuroplasticity&Concept Redistribution
24.01	Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University	arxiv	PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY	Intelligent Personal Assistant&LLM Agent&Security and Privacy
24.01	Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group	arxiv	Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems	Safety&Risk Taxonomy&Mitigation Strategies
24.01	Google Research	arxiv	Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models	Interpretability
24.01	Ben-Gurion University of the Negev Israel	arxiv	GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS	GPTs&Cybersecurity&ChatGPT
24.01	Shanghai Jiao Tong University	arxiv	R-Judge: Benchmarking Safety Risk Awareness for LLM Agents	LLM Agents&Safety Risk Awareness&Benchmark
24.01	Ant Group	arxiv	A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM	Distributed LLM&Security
24.01	Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China	arxiv	PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety	Multi-agent Systems&Agent Psychology&Safety
24.01	Rochester Institute of Technology	arxiv	Mitigating Security Threats in LLMs	Security Threats&Prompt Injection&Jailbreaking
24.01	Johns Hopkins University, University of Pennsylvania, Ohio State University	arxiv	The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts	Multilingualism&Safety&Resource Disparity
24.01	University of Florida	arxiv	Adaptive Text Watermark for Large Language Models	Text Watermarking&Robustness&Security
24.01	The Hebrew University	arXiv	TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS	Language Model Alignment&AI Safety&Representation Engineering
24.01	Google Research， Anthropic	arxiv	Gradient-Based Language Model Red Teaming	Red Teaming&Safety&Prompt Learning
24.01	National University of Singapore， Pennsylvania State University	arxiv	Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code	Watermarking&Error Correction Code&AI Ethics
24.01	Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc.	arxiv	Prompt-Driven LLM Safeguarding via Directed Representation Optimization	Safety Prompts&Representation Optimization
24.02	Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research	arxiv	Adaptive Primal-Dual Method for Safe Reinforcement Learning	Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates
24.02	Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute	arxiv	No More Trade-Offs: GPT and Fully Informative Privacy Policies	ChatGPT&Privacy Policies&Legal Requirements
24.02	Florida International University	arxiv	Security and Privacy Challenges of Large Language Models: A Survey	Security&Privacy Challenges&Suevey
24.02	Rutgers University, University of California, Santa Barbara, NEC Labs America	arxiv	TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution	LLM-based Agents&Safety&Trustworthiness
24.02	University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research	arxiv	Shadowcast: Stealthy Data Poisoning Attacks against VLMs	Vision-Language Models&Data Poisoning&Security
24.02	Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong	arxiv	SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models	Safety Benchmark&Safety Evaluation&Hierarchical Taxonomy**
24.02	Fudan University	arxiv	ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages	Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword
24.02	Paul G. Allen School of Computer Science & Engineering, University of Washington	arxiv	SPML: A DSL for Defending Language Models Against Prompt Attacks	Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML)
24.02	Tsinghua University	arxiv	ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors	Safety Detectors&Customizable&Explainable
24.02	Dalhousie University	arxiv	Immunization Against Harmful Fine-tuning Attacks	Fine-tuning Attacks&Immunization
24.02	Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group	arxiv	SoFA: Shielded On-the-fly Alignment via Priority Rule Following	Priority Rule Following&Alignment
24.02	Universidade Federal de Santa Catarina	arxiv	A Survey of Large Language Models in Cybersecurity	Cybersecurity&Vulnerability Assessment
24.02	Zhejiang University	arxiv	PRSA: Prompt Reverse Stealing Attacks against Large Language Models	Prompt Reverse Stealing Attacks&Security
24.02	Shanghai Artificial Intelligence Laboratory	NAACL2024	Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey	Large Language Models&Conversation Safety&Survey
24.03	Tulane University	arxiv	ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION	Reinforcement Learning&Human Feedback&Safety Constraints
24.03	University of Illinois Urbana-Champaign	arxiv	INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents	Tool Integration&Security&Indirect Prompt Injection
24.03	Harvard University	arxiv	Towards Safe and Aligned Large Language Models for Medicine	*Medical Safety&Alignment&Ethical Principles
24.03	Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab	arxiv	ALIGNERS: DECOUPLING LLMS AND ALIGNMENT	Alignment&Synthetic Data
24.03	MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC	arxiv	A Safe Harbor for AI Evaluation and Red Teaming	AI Evaluation&Red Teaming&Safe Harbor
24.03	University of Southern California	arxiv	Logits of API-Protected LLMs Leak Proprietary Information	API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection
24.03	University of Bristol	arxiv	Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention	Safety&Prompt Engineering
24.03	XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia	arxiv	Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models	Safety&Guidelines&Alignment
24.03	Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology	arxiv	OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety	Chinese LLMs&Benchmarking&Safety
24.03	Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India	arxiv	Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal	LLM Security&Threat modeling&Risk Assessment
24.03	Queen’s University Belfast	arxiv	AI Safety: Necessary but insufficient and possibly problematic	AI Safety&Transparency&Structural Harm
24.04	Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology	arxiv	Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs	Dialectical Alignment&3H Principle&Security Threats
24.04	LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI	arxiv	Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models	Red Teaming&Safety
24.04	University of California, Santa Barbara, Meta AI	arxiv	Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models	Safety&Helpfulness&Controllability
24.04	School of Information and Software Engineering, University of Electronic Science and Technology of China	arxiv	Exploring Backdoor Vulnerabilities of Chat Models	Backdoor Attacks&Chat Models&Security
24.04	Enkrypt AI	arxiv	INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION	Fine-tuning&Quantization&LLM Vulnerabilities
24.04	TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory	arxiv	Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security	Multimodal Large Language Models&Security Vulnerabilities&Image Inputs
24.04	University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI	arxiv	CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge	AI-Assisted Red-Teaming&Multicultural Knowledge
24.04	Nanjing University	DLSP 2024	Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts	Jailbreak&Subtoxic Questions&GAC Model
24.04	Innodata	arxiv	Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations	Evaluation&Safety
24.04	University of Cambridge, New York University, ETH Zurich	arxiv	Foundational Challenges in Assuring Alignment and Safety of Large Language Models	Alignment&Safety
24.04	Zhejiang University	arxiv	TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment	Intellectual Property Protection&Edge-deployed Transformer Model
24.04	Harvard University	arxiv	More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness	Reinforcement Learning from Human Feedback&Trustworthiness
24.05	University of Maryland	arxiv	Constrained Decoding for Secure Code Generation	Code Generation&Code LLM&Secure Code Generation&AI Safety
24.05	Huazhong University of Science and Technology	arxiv	Large Language Models for Cyber Security: A Systematic Literature Review	Cybersecurity&Systematic Review
24.04	CSIRO’s Data61	ACM International Conference on AI-powered Software	An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping	AI Safety&Evaluation Framework&AI Lifecycle Mapping
24.05	CSAIL and CBMM, MIT	arxiv	SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data	SecureLLM&Compositionality
24.05	Carnegie Mellon University	arxiv	Human–AI Safety: A Descendant of Generative AI and Control Systems Safety	Human–AI Safety&Generative AI
24.05	GSAI POSTECH	arxiv	Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents	Adversarial DPO&Reducing Toxicity&Dialogue Agents

💻Presentations & Talks

📖Tutorials & Workshops

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link

📰News & Articles

Date	Type	Title	URL
23.01	video	ChatGPT and InstructGPT: Aligning Language Models to Human Intention	link
23.06	Report	“Dual-use dilemma” for GenAI Workshop Summarization	link
23.10	News	Joint Statement on AI Safety and Openness	link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security.md

Security.md

Security

Different from the main README🕵️

📑Papers

💻Presentations & Talks

📖Tutorials & Workshops

📰News & Articles

🧑‍🏫Scholars

Files

Security.md

Latest commit

History

Security.md

File metadata and controls

Security

Different from the main README🕵️

📑Papers

💻Presentations & Talks

📖Tutorials & Workshops

📰News & Articles

🧑‍🏫Scholars