I currently work on mechanistic interpretability in industry.
I got my PhD in computer science from MIT, where I was advised by Prof. Aleksander Madry. Before starting at MIT, I did Part III of the Mathematical Tripos at Cambridge University, with coursework in combinatorics and algebra. And before that, I earned a BA, with a joint concentration in math and computer science, at Harvard College, where I worked with Prof. Salil Vadhan.
I'm also broadly interested in the research, design and implementation of tools that make the work of scientists and practitioners in computational fields easier. As part of this, I used to work on mandala, a Python library to simplify scientific data management.
Google Scholar | Semantic Scholar | Twitter
Persona Features Control Emergent Misalignment
M. Wang*, T. Dupré la Tour*, O. Watkins*, A. Makelov*, R. Chi*, S. Miserendino, J. Wang, A. Rajaram, J. Heidecke, T. Patwardhan, D. Mossing*
arXiv preprint
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task
A. Makelov
Spotlight, ICML 2024 Workshop on Mechanistic Interpretability
See also: AI Alignment Forum post
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
A. Makelov*, G. Lange*, N. Nanda
ICLR 2025
mandala: Compositional Memoization for Simple & Powerful Scientific Data Management
A. Makelov
SciPy 2024 Proceedings
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
A. Makelov*, G. Lange*, N. Nanda
ICLR 2024
Backdoor or Feature? A New Perspective on Data Poisoning
A. Khaddaj*, G. Leclerc*, A. Makelov*, K. Georgiev, A. Ilyas, H. Salman, A. Madry
ICML 2023
Towards Deep Learning Models Resistant to Adversarial Attacks
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu
ICLR 2018
Expansion in Lifts of Graphs
A. Makelov
Undergraduate Thesis, Harvard College 2015
Practical dependency tracking for Python function calls (June '23)
Mandala: Python programs that save, query and version themselves (April '23)