About me

I am a PhD student in the Department of Electrical and Computer Engineering at Princeton University, advised by Prof. Peter Henderson. Before coming to Princeton, I completed my undergraduate degree in University of Science and Technology of China (USTC).

I am broadly interested in alignment and other safety-related topics. Feel free to reach out if you are interested in collaborating on research or discussing these topics.

Selected Publications and Manuscripts

[1] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [Paper], [Website], [Code]

Boyi Wei*, Kaixuan Huang*, Yangsibo Huang*, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson

image

Highlights

  1. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels.
  2. We find the isolated regions are sparse, comprising about 3% at the parameter level and 2.5% at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model’s safety mechanisms.
  3. We show that the model remains vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted . These findings underscore the urgent need for more robust safety strategies in LLMs.

Boyi Wei*, Weijia Shi*, Yangsibo Huang*, Noah A. Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, Peter Henderson

image

Highlights

  1. We propose an evaluation suite to evaluate the feasibility and side effects of copyright takedown methods for language models.
  2. We propose a taxonomy of causes of undesirable regurgitation and takedown methods.
  3. We conduct a comprehensive evaluation on 8 off-the-shelf takedown methods, and we find that none of these methods excels across all the metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.

Hits

News & Talks

  1. [05/2024] πŸŽ‰ Our Paper: Evaluating Copyright Takedown Methods for Language Models has been accepted to NeurIPS 2024 Datasets and Benchmarks! See you in Vancouver πŸ‡¨πŸ‡¦.
  2. [08/2024] πŸŽ™οΈ Gave a talk about assessing the brittleness of safety alignment @ Techbeat (in Chinese).
  3. [07/2024] πŸŽ™οΈ Gave a talk about assessing the brittleness of safety alignment and CoTaEval @ Google.
  4. [05/2024] πŸŽ‰ Our Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications has been accepted to ICML 2024! See you in Vienna πŸ‡¦πŸ‡Ή.
  5. [03/2024] πŸŽ‰ Our Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications has been selected as the Best Paper of SeT LLM @ ICLR 2024!