Skip to content
View qizhangli's full-sized avatar

Block or report qizhangli

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873

Python 114 11 Updated May 6, 2024

Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"

Python 18 Updated Oct 26, 2023

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]

Python 195 20 Updated Sep 26, 2024

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unita…

Python 936 114 Updated Sep 19, 2024

An easy-to-use Python framework to generate adversarial jailbreak prompts.

Python 422 37 Updated Sep 2, 2024

A fast + lightweight implementation of the GCG algorithm in PyTorch

Python 91 23 Updated Sep 20, 2024

An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)

Python 27 7 Updated Feb 8, 2024

Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.

Python 10,331 1,548 Updated Oct 7, 2024

Jailbreak artifacts for JailbreakBench

35 7 Updated Sep 4, 2024

Implementing the Chain Of Density text summarisation technique from recent NLP research by researchers at Salesforce, MIT, Columbia, etc. Takes a long text input and iteratively generates increasin…

Python 66 7 Updated Oct 8, 2023

[NeurIPS 2024] Large Language Model Unlearning via Embedding-Corrupted Prompts

Python 8 Updated Sep 26, 2024

A resource repository for machine unlearning in large language models

166 7 Updated Oct 3, 2024

[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, etc.

Python 839 98 Updated Oct 7, 2024

Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"

Python 415 48 Updated Apr 24, 2024
Python 11 Updated Sep 28, 2024

TAP: An automated jailbreaking method for black-box LLMs

Python 109 17 Updated Mar 8, 2024

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".

Python 88 22 Updated Oct 1, 2024

Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"

Python 57 6 Updated Feb 27, 2024

Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"

Python 13 Updated Jul 12, 2024

Weak-to-Strong Jailbreaking on Large Language Models

Python 62 8 Updated Feb 21, 2024

[ACL'24, Outstanding Paper] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Python 27 Updated Aug 2, 2024

Improving Alignment and Robustness with Circuit Breakers

Jupyter Notebook 135 16 Updated Sep 24, 2024

Adapting LLaMA Decoder to Vision Transformer

Python 25 1 Updated May 20, 2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]

Shell 190 20 Updated Sep 20, 2024

Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives

Python 59 5 Updated Feb 22, 2024

PAL: Proxy-Guided Black-Box Attack on Large Language Models

Python 45 4 Updated Aug 17, 2024

Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"

Python 39 7 Updated Apr 24, 2024

Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.

Python 102 9 Updated Jun 13, 2024

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。

506 25 Updated Apr 7, 2024
Next