About me
I am a PhD student in the Department of Electrical and Computer Engineering at Princeton University, advised by Prof. Peter Henderson. Before coming to Princeton, I completed my undergraduate degree in University of Science and Technology of China (USTC).
I am broadly interested in alignment and other safety-related topics. Feel free to reach out if you are interested in collaborating on research or discussing these topics.
Selected Publications and Manuscripts
[1] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [Paper], [Website], [Code]
Boyi Wei*, Kaixuan Huang*, Yangsibo Huang*, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
Highlights
- We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels.
- We find the isolated regions are sparse, comprising about 3% at the parameter level and 2.5% at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the modelβs safety mechanisms.
- We show that the model remains vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted . These findings underscore the urgent need for more robust safety strategies in LLMs.
[2] Evaluating Copyright Takedown Methods for Language Models [Paper], [Website], [Code] [Dataset], [Leaderboard]
Boyi Wei*, Weijia Shi*, Yangsibo Huang*, Noah A. Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, Peter Henderson
Highlights
- We propose an evaluation suite to evaluate the feasibility and side effects of copyright takedown methods for language models.
- We propose a taxonomy of causes of undesirable regurgitation and takedown methods.
- We conduct a comprehensive evaluation on 8 off-the-shelf takedown methods, and we find that none of these methods excels across all the metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.
News & Talks
- [05/2024] π Our Paper: Evaluating Copyright Takedown Methods for Language Models has been accepted to NeurIPS 2024 Datasets and Benchmarks! See you in Vancouver π¨π¦.
- [08/2024] ποΈ Gave a talk about assessing the brittleness of safety alignment @ Techbeat (in Chinese).
- [07/2024] ποΈ Gave a talk about assessing the brittleness of safety alignment and CoTaEval @ Google.
- [05/2024] π Our Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications has been accepted to ICML 2024! See you in Vienna π¦πΉ.
- [03/2024] π Our Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications has been selected as the Best Paper of SeT LLM @ ICLR 2024!