AI Safety

Here you will find posts related to AI Safety and Mechanistic Interpretability.

2026

2026-01-14 — Few-Shot Awareness in Reasoning Models

2025

2025-09-02 — Instruction following inside CoT for better Model Organisms
2025-08-25 — Towards Mitigating Information Leakage When Evaluating Safety Monitors
2025-02-15 — AISC: Probing for Deception Detection

2024

2024-11-22 — Interpretability Hackathon
2024-10-27 — AI Policy Hackathon
2024-10-03 — Mechanistic Exploration Gemma 2 List Generation
2024-09-29 — AI Safety Fundamental Final Project
2024-03-19 — AI Control with Mechanistic Interpretability
2024-03-03 — Towards a Probabilistic Disentanglement of Transformer Activations Part 2
2024-02-28 — Towards a Probabilistic Disentanglement of Transformer Activations Part 1

2023

2023-10-15 — Word Embeddings: A Comprehensive Guide Part 2
2023-09-22 — Word Embeddings: A Comprehensive Guide Part 1
2023-08-23 — Balanced Sentence Part 1
2023-05-19 — Introduction to mechanistic Interpretability
2023-05-19 — My first post