Skip to content

Latest commit

 

History

History
129 lines (116 loc) · 32.1 KB

Security.md

File metadata and controls

129 lines (116 loc) · 32.1 KB

Security

Different from the main README🕵️

  • Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
  • In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
  • Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date Institute Publication Paper Keywords
20.10 Facebook AI Research arxiv Recipes for Safety in Open-domain Chatbots Toxic Behavior&Open-domain
22.02 DeepMind EMNLP2022 Red Teaming Language Models with Language Model Red Teaming&Harm Test
22.03 OpenAI NIPS2022 Training language models to follow instructions with human feedback InstructGPT&RLHF&Harmless
22.04 Anthropic arxiv Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Helpful&Harmless
22.05 UCSD EMNLP2022 An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models Privacy Risks&Memorization
22.09 Anthropic arxiv Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Red Teaming&Harmless&Helpful
22.12 Anthropic arxiv Constitutional AI: Harmlessness from AI Feedback Harmless&Self-improvement&RLAIF
23.07 UC Berkeley NIPS2023 Jailbroken: How Does LLM Safety Training Fail? Jailbreak&Competing Objectives&Mismatched Generalization
23.08 The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong arxiv GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher Safety Alignment&Adversarial Attack
23.08 University College London, University College London, Tilburg University arxiv Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities Security&AI Alignment
23.09 Peking University arxiv RAIN: Your Language Models Can Align Themselves without Finetuning Self-boosting&Rewind Mechanisms
23.10 Princeton University, Virginia Tech, IBM Research, Stanford University arxiv FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! Fine-tuning****Safety Risks&Adversarial Training
23.10 UC Riverside arXiv Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks Adversarial Attacks&Vulnerabilities&Model Security
23.11 KAIST AI arxiv HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning Hate Speech&Detection
23.11 CMU AACL2023(ART or Safety workshop) Measuring Adversarial Datasets Adversarial Robustness&AI Safety&Adversarial Datasets
23.11 UIUC arxiv Removing RLHF Protections in GPT-4 via Fine-Tuning Remove Protection&Fine-Tuning
23.11 IT University of Copenhagen,University of Washington arxiv Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild Red Teaming
23.11 Fudan University&Shanghai AI lab arxiv Fake Alignment: Are LLMs Really Aligned Well? Alignment Failure&Safety Evaluation
23.11 University of Southern California arxiv SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data RLHF&Safety
23.11 Google Research arxiv AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications Adversarial Testing&AI-Assisted Red Teaming&Application Safety
23.11 Tencent AI Lab arxiv ADVERSARIAL PREFERENCE OPTIMIZATION Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction
23.11 Docta.ai arxiv Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models Data Credibility&Safety alignment
23.11 CIIRC CTU in Prague arxiv A Security Risk Taxonomy for Large Language Models Security risks&Taxonomy&Prompt-based attacks
23.11 Meta&University of Illinois Urbana-Champaign NAACL2024 MART: Improving LLM Safety with Multi-round Automatic Red-Teaming Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing
23.11 The Ohio State University&University of California, Davis NAACL2024 How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities Open-Source LLMs&Malicious Demonstrations&Trustworthiness
23.12 Drexel University arXiv A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly Security&Privacy&Attacks
23.12 Tenyx arXiv Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation Geometric Interpretation&Intrinsic Dimension&Toxicity Detection
23.12 Independent (Now at Google DeepMind) arXiv Scaling Laws for Adversarial Attacks on Language Model Activations Adversarial Attacks&Language Model Activations&Scaling Laws
23.12 University of Liechtenstein, University of Duesseldorf arxiv NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS Negotiation&Reasoning&Prompt Hacking
23.12 University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University arXiv Exploring the Limits of ChatGPT in Software Security Applications Software Security
23.12 GenAI at Meta arxiv Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations Human-AI Conversation&Safety Risk taxonomy
23.12 University of California Riverside, Microsoft arxiv Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack Safety Alignment&Summarization&Vulnerability
23.12 MIT, Harvard NIPS2023(Workshop) Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Competing Objectives&Forbidden Fact Task&Model Decomposition
23.12 University of Science and Technology of China arxiv Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models Text Protection&Silent Guardian
23.12 OpenAI Open AI Practices for Governing Agentic AI Systems Agentic AI Systems&LM Based Agent
23.12 University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University arxiv Learning and Forgetting Unsafe Examples in Large Language Models Safety Issues&ForgetFilter Algorithm&Unsafe Content
23.12 Tencent AI Lab, The Chinese University of Hong Kong arxiv Aligning Language Models with Judgments Judgment Alignment&Contrastive Unlikelihood Training
24.01 Delft University of Technology arxiv Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks Red Teaming&Hallucinations&Mathematics Tasks
24.01 Apart Research, University of Edinburgh, Imperial College London, University of Oxford arxiv Large Language Models Relearn Removed Concepts Neuroplasticity&Concept Redistribution
24.01 Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University arxiv PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY Intelligent Personal Assistant&LLM Agent&Security and Privacy
24.01 Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group arxiv Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems Safety&Risk Taxonomy&Mitigation Strategies
24.01 Google Research arxiv Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Interpretability
24.01 Ben-Gurion University of the Negev Israel arxiv GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS GPTs&Cybersecurity&ChatGPT
24.01 Shanghai Jiao Tong University arxiv R-Judge: Benchmarking Safety Risk Awareness for LLM Agents LLM Agents&Safety Risk Awareness&Benchmark
24.01 Ant Group arxiv A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM Distributed LLM&Security
24.01 Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China arxiv PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety Multi-agent Systems&Agent Psychology&Safety
24.01 Rochester Institute of Technology arxiv Mitigating Security Threats in LLMs Security Threats&Prompt Injection&Jailbreaking
24.01 Johns Hopkins University, University of Pennsylvania, Ohio State University arxiv The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts Multilingualism&Safety&Resource Disparity
24.01 University of Florida arxiv Adaptive Text Watermark for Large Language Models Text Watermarking&Robustness&Security
24.01 The Hebrew University arXiv TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS Language Model Alignment&AI Safety&Representation Engineering
24.01 Google Research, Anthropic arxiv Gradient-Based Language Model Red Teaming Red Teaming&Safety&Prompt Learning
24.01 National University of Singapore, Pennsylvania State University arxiv Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code Watermarking&Error Correction Code&AI Ethics
24.01 Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc. arxiv Prompt-Driven LLM Safeguarding via Directed Representation Optimization Safety Prompts&Representation Optimization
24.02 Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research arxiv Adaptive Primal-Dual Method for Safe Reinforcement Learning Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates
24.02 Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute arxiv No More Trade-Offs: GPT and Fully Informative Privacy Policies ChatGPT&Privacy Policies&Legal Requirements
24.02 Florida International University arxiv Security and Privacy Challenges of Large Language Models: A Survey Security&Privacy Challenges&Suevey
24.02 Rutgers University, University of California, Santa Barbara, NEC Labs America arxiv TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution LLM-based Agents&Safety&Trustworthiness
24.02 University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research arxiv Shadowcast: Stealthy Data Poisoning Attacks against VLMs Vision-Language Models&Data Poisoning&Security
24.02 Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong arxiv SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy
24.02 Fudan University arxiv ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword
24.02 Paul G. Allen School of Computer Science & Engineering, University of Washington arxiv SPML: A DSL for Defending Language Models Against Prompt Attacks Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML)
24.02 Tsinghua University arxiv ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors Safety Detectors&Customizable&Explainable
24.02 Dalhousie University arxiv Immunization Against Harmful Fine-tuning Attacks Fine-tuning Attacks&Immunization
24.02 Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group arxiv SoFA: Shielded On-the-fly Alignment via Priority Rule Following Priority Rule Following&Alignment
24.02 Universidade Federal de Santa Catarina arxiv A Survey of Large Language Models in Cybersecurity Cybersecurity&Vulnerability Assessment
24.02 Zhejiang University arxiv PRSA: Prompt Reverse Stealing Attacks against Large Language Models Prompt Reverse Stealing Attacks&Security
24.02 Shanghai Artificial Intelligence Laboratory NAACL2024 Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey Large Language Models&Conversation Safety&Survey
24.03 Tulane University arxiv ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION Reinforcement Learning&Human Feedback&Safety Constraints
24.03 University of Illinois Urbana-Champaign arxiv INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents Tool Integration&Security&Indirect Prompt Injection
24.03 Harvard University arxiv Towards Safe and Aligned Large Language Models for Medicine *Medical Safety&Alignment&Ethical Principles
24.03 Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab arxiv ALIGNERS: DECOUPLING LLMS AND ALIGNMENT Alignment&Synthetic Data
24.03 MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC arxiv A Safe Harbor for AI Evaluation and Red Teaming AI Evaluation&Red Teaming&Safe Harbor
24.03 University of Southern California arxiv Logits of API-Protected LLMs Leak Proprietary Information API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection
24.03 University of Bristol arxiv Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention Safety&Prompt Engineering
24.03 XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia arxiv Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models Safety&Guidelines&Alignment
24.03 Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology arxiv OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety Chinese LLMs&Benchmarking&Safety
24.03 Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India arxiv Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal LLM Security&Threat modeling&Risk Assessment
24.03 Queen’s University Belfast arxiv AI Safety: Necessary but insufficient and possibly problematic AI Safety&Transparency&Structural Harm
24.04 Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology arxiv Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs Dialectical Alignment&3H Principle&Security Threats
24.04 LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI arxiv Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models Red Teaming&Safety
24.04 University of California, Santa Barbara, Meta AI arxiv Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models Safety&Helpfulness&Controllability
24.04 School of Information and Software Engineering, University of Electronic Science and Technology of China arxiv Exploring Backdoor Vulnerabilities of Chat Models Backdoor Attacks&Chat Models&Security
24.04 Enkrypt AI arxiv INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION Fine-tuning&Quantization&LLM Vulnerabilities
24.04 TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory arxiv Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security Multimodal Large Language Models&Security Vulnerabilities&Image Inputs
24.04 University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI arxiv CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge AI-Assisted Red-Teaming&Multicultural Knowledge
24.04 Nanjing University DLSP 2024 Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts Jailbreak&Subtoxic Questions&GAC Model
24.04 Innodata arxiv Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations Evaluation&Safety
24.04 University of Cambridge, New York University, ETH Zurich arxiv Foundational Challenges in Assuring Alignment and Safety of Large Language Models Alignment&Safety
24.04 Zhejiang University arxiv TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment Intellectual Property Protection&Edge-deployed Transformer Model
24.04 Harvard University arxiv More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness Reinforcement Learning from Human Feedback&Trustworthiness
24.05 University of Maryland arxiv Constrained Decoding for Secure Code Generation Code Generation&Code LLM&Secure Code Generation&AI Safety
24.05 Huazhong University of Science and Technology arxiv Large Language Models for Cyber Security: A Systematic Literature Review Cybersecurity&Systematic Review
24.04 CSIRO’s Data61 ACM International Conference on AI-powered Software An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping AI Safety&Evaluation Framework&AI Lifecycle Mapping
24.05 CSAIL and CBMM, MIT arxiv SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data SecureLLM&Compositionality
24.05 Carnegie Mellon University arxiv Human–AI Safety: A Descendant of Generative AI and Control Systems Safety Human–AI Safety&Generative AI
24.05 GSAI POSTECH arxiv Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents Adversarial DPO&Reducing Toxicity&Dialogue Agents

💻Presentations & Talks

📖Tutorials & Workshops

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

📰News & Articles

Date Type Title URL
23.01 video ChatGPT and InstructGPT: Aligning Language Models to Human Intention link
23.06 Report “Dual-use dilemma” for GenAI Workshop Summarization link
23.10 News Joint Statement on AI Safety and Openness link

🧑‍🏫Scholars