Skip to content



Transformer-specific Interpretability

Transformers have emerged as dominant players in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context-mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods and how they can be applied to their own research work.


  1. 14:00-14:30: lecture on model-agnostic interpretability:
    • Introduction
    • Model-agnostic approaches: probing, feature attributions, behavioral studies
    • How are model-agnostic approaches adapted to Transformers? What are their limitations?
  2. 14:30-15:00: lecture on the analysis of context mixing in transformers:
    • An overview of mathematics in Transformers
    • Attention analysis and its limitations.
    • Measures of context mixing: expanding the scope of the analysis beyond attention
  3. 15:00-15:30: interactive notebook session on interpreting context mixing
  4. Coffee break
  5. 16:00-16:30: lecture on mechanistic and causality-based interpretability:
    • Basics of mechanistic interpretability: the residual stream and computational graph views of models, and the circuits framework
    • Finding circuit structure using causal interventions
    • Assigning semantics to circuit components using logit lens.
  1. 16:30-17:00: interactive notebook session on mechanistic interpretability
  2. 17:00-17:30:  Open discussion, reflection, and future outlook: what are open questions in interpretability, what’s next, and what’s lacking?

Mailing list

If you’re interested in receiving updates (follow-ups, corrections, etc) on the tutorial, do sign up for our tutorial mailing list.

Suggested Readings

Part 1: Model-agnostic Interpretability

General Interpretability

Feature Attributions



On the limitation of general-purpose interpretability methods

Part 2: Context Mixing

Limitations of Attention

Measures of context-mixing beyond attention

Part 3: Mechanistic Interpretability / Circuits


Automated circuit finding:

Studies of individual components:

Mechanistic Interpretability Methods:

Anthropic’s Transformer Circuits Thread:

Other mechanistic work: