Next events

Wednesday, December 17, 25 2025 15:00 CET
Click here to register!
A mechanistic view of refusal in language models
Speaker: Andy Arditi (Northeastern University) Andy is a PhD student at Northeastern University advised by David Bau, where he researches mechanistic interpretability and the internal structure of large language models. His recent work focuses on applying mechanistic interpretability to practical problems, such as understanding model refusals, detecting hallucinations, and uncovering linear representations of persona in chat assistants. Before his PhD, he earned bachelor’s and master’s degrees from Columbia University and worked as a software engineer at Microsoft."

Refusal is currently the primary mechanism by which LLM developers prevent misuse: models are trained to decline harmful or inappropriate requests. We study this critical behavior through a mechanistic lens, asking how refusal is implemented internally, and we show that it is largely mediated by a single direction in activation space. This simple observation enables several applications, including “weight orthogonalization,” a cheap and effective method for disabling refusal guardrails in open-source models. It has also proven to be useful for implementing more efficient red teaming and adversarial training pipelines. In this talk, I will present these findings, discuss follow-up work by other groups, and outline the greater implications for open-source model development.