Stars
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unita…
An easy-to-use Python framework to generate adversarial jailbreak prompts.
A fast + lightweight implementation of the GCG algorithm in PyTorch
rotaryhammer / code-autodan
Forked from llm-attacks/llm-attacksAn unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)
Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.
Implementing the Chain Of Density text summarisation technique from recent NLP research by researchers at Salesforce, MIT, Columbia, etc. Takes a long text input and iteratively generates increasin…
[NeurIPS 2024] Large Language Model Unlearning via Embedding-Corrupted Prompts
A resource repository for machine unlearning in large language models
[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, etc.
Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"
TAP: An automated jailbreaking method for black-box LLMs
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"
Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"
Weak-to-Strong Jailbreaking on Large Language Models
[ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
Improving Alignment and Robustness with Circuit Breakers
Adapting LLaMA Decoder to Vision Transformer
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
PAL: Proxy-Guided Black-Box Attack on Large Language Models
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。