Alex Makelov

I currently work on mechanistic interpretability in industry.

I got my PhD in computer science from MIT, where I was advised by Prof. Aleksander Madry. Before starting at MIT, I did Part III of the Mathematical Tripos at Cambridge University, with coursework in combinatorics and algebra. And before that, I earned a BA, with a joint concentration in math and computer science, at Harvard College, where I worked with Prof. Salil Vadhan.

I'm also broadly interested in the research, design and implementation of tools that make the work of scientists and practitioners in computational fields easier. As part of this, I used to work on mandala, a Python library to simplify scientific data management.

Google Scholar | Semantic Scholar | Twitter

Publications

Persona Features Control Emergent Misalignment
M. Wang*, T. Dupré la Tour*, O. Watkins*, A. Makelov*, R. Chi*, S. Miserendino, J. Wang, A. Rajaram, J. Heidecke, T. Patwardhan, D. Mossing*
arXiv preprint

Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task
A. Makelov
Spotlight, ICML 2024 Workshop on Mechanistic Interpretability
See also: AI Alignment Forum post

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
A. Makelov*, G. Lange*, N. Nanda
ICLR 2025

mandala: Compositional Memoization for Simple & Powerful Scientific Data Management
A. Makelov
SciPy 2024 Proceedings

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
A. Makelov*, G. Lange*, N. Nanda
ICLR 2024

Backdoor or Feature? A New Perspective on Data Poisoning
A. Khaddaj*, G. Leclerc*, A. Makelov*, K. Georgiev, A. Ilyas, H. Salman, A. Madry
ICML 2023

Towards Deep Learning Models Resistant to Adversarial Attacks
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu
ICLR 2018

Expansion in Lifts of Graphs
A. Makelov
Undergraduate Thesis, Harvard College 2015

Blog

Practical dependency tracking for Python function calls (June '23)

Mandala: Python programs that save, query and version themselves (April '23)