Instruction following inside CoT for better Model Organisms
Here you will find posts related to AI Safety and Mechanistic Interpretability.
Investigating Local circuit in the feature basis
I’m excited to share our recent paper on AI Control through Mechanistic Interpretability approaches.
This is the second post in our series dedicated to exploring methods around dictionary learning, and the possibility of a probabilistic take on the disentang...
This is the first of a series of posts where I lay down some intuitions and best practices on how to perform probabilistic dictionary learning on Transformer...
Word Embeddings: A Comprehensive Guide Part 2
Table of Contents
Balanced Sentence in GPT-2 Part 1
In the second post on my blog, I will present a brief introduction to the main goals and methods of Mechanistic Interpretability (MI), as well as its history.
Welcome to my blog