qizhangli

Follow

Qizhang Li qizhangli

Follow

35 followers · 21 following

Stars

facebookresearch / advprompter

Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873

Python 114 11 Updated May 6, 2024

neelsjain / baseline-defenses

Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"

Python 18 Updated Oct 26, 2023

JailbreakBench / jailbreakbench

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]

Python 195 20 Updated Sep 26, 2024

unitaryai / detoxify

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unita…

Python 936 114 Updated Sep 19, 2024

EasyJailbreak / EasyJailbreak

An easy-to-use Python framework to generate adversarial jailbreak prompts.

Python 422 37 Updated Sep 2, 2024

GraySwanAI / nanoGCG

A fast + lightweight implementation of the GCG algorithm in PyTorch

Python 91 23 Updated Sep 20, 2024

rotaryhammer / code-autodan

Forked from llm-attacks/llm-attacks

An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)

Python 27 7 Updated Feb 8, 2024

jacobgil / pytorch-grad-cam

Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.

Python 10,331 1,548 Updated Oct 7, 2024

JailbreakBench / artifacts

Jailbreak artifacts for JailbreakBench

35 7 Updated Sep 4, 2024

richawo / chain-of-density

Implementing the Chain Of Density text summarisation technique from recent NLP research by researchers at Salesforce, MIT, Columbia, etc. Takes a long text input and iteratively generates increasin…

Python 66 7 Updated Oct 8, 2023

chrisliu298 / llm-unlearn-eco

[NeurIPS 2024] Large Language Model Unlearning via Embedding-Corrupted Prompts

Python 8 Updated Sep 26, 2024

chrisliu298 / awesome-llm-unlearning

A resource repository for machine unlearning in large language models

166 7 Updated Oct 3, 2024

horseee / LLM-Pruner

[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, etc.

Python 839 98 Updated Oct 7, 2024

voidism / DoLa

Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"

Python 415 48 Updated Apr 24, 2024

Uppaal / detox-edit

Python 11 Updated Sep 28, 2024

RICommunity / TAP

TAP: An automated jailbreaking method for black-box LLMs

Python 109 17 Updated Mar 8, 2024

andyrdt / refusal_direction

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".

Python 88 22 Updated Oct 1, 2024

HillZhang1999 / ICD

Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"

Python 57 6 Updated Feb 27, 2024

yfqiu-nlp / sea-llm

Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"

Python 13 Updated Jul 12, 2024

XuandongZhao / weak-to-strong

Weak-to-Strong Jailbreaking on Large Language Models

Python 62 8 Updated Feb 21, 2024

ZHZisZZ / emulated-disalignment

[ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Python 27 Updated Aug 2, 2024

GraySwanAI / circuit-breakers

Improving Alignment and Robustness with Circuit Breakers

Jupyter Notebook 135 16 Updated Sep 24, 2024

thestephencasper / latent_adversarial_training

Python 15 4 Updated Jul 25, 2024

techmonsterwang / iLLaMA

Adapting LLaMA Decoder to Vision Transformer

Python 25 1 Updated May 20, 2024

tml-epfl / llm-adaptive-attacks

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]

Shell 190 20 Updated Sep 20, 2024

JonasGeiping / carving

Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives

Python 59 5 Updated Feb 22, 2024

chawins / pal

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Python 45 4 Updated Aug 17, 2024

ethz-spylab / rlhf-poisoning

Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"

Python 39 7 Updated Apr 24, 2024

ethz-spylab / rlhf_trojan_competition

Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.

Python 102 9 Updated Jun 13, 2024

jianzhnie / awesome-instruction-datasets

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。

506 25 Updated Apr 7, 2024