Tutorials

Upcoming:

Transformer-specific Interpretability

A tutorial at EACL’24, Malta
March 21, 2024; 3 hours + break (14h-17h30)
Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi and Willem Zuidema
Github site (with notebooks): https://github.com/interpretingdl/eacl2024_transformer_interpretability_tutorial

Transformers have emerged as dominant players in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context-mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods and how they can be applied to their own research work.

Schedule:

14:00-14:30: lecture on model-agnostic interpretability:
- Introduction
- Model-agnostic approaches: probing, feature attributions, behavioral studies
- How are model-agnostic approaches adapted to Transformers? What are their limitations?
14:30-15:00: lecture on the analysis of context mixing in transformers:
- An overview of mathematics in Transformers
- Attention analysis and its limitations.
- Measures of context mixing: expanding the scope of the analysis beyond attention
15:00-15:30: interactive notebook session on interpreting context mixing
Coffee break
16:00-16:30: lecture on mechanistic and causality-based interpretability:
- Basics of mechanistic interpretability: the residual stream and computational graph views of models, and the circuits framework
- Finding circuit structure using causal interventions
- Assigning semantics to circuit components using logit lens.

16:30-17:00: interactive notebook session on mechanistic interpretability
17:00-17:30: Open discussion, reflection, and future outlook: what are open questions in interpretability, what’s next, and what’s lacking?

Mailing list

If you’re interested in receiving updates (follow-ups, corrections, etc) on the tutorial, do sign up for our tutorial mailing list.

Suggested Readings

Part 1: Model-agnostic Interpretability

General Interpretability

Doshi-Velez & Kim (2017) – Towards a rigorous science of interpretable machine learning
Lipton (WHI 2016) – The Mythos of Model Interpretability

Feature Attributions

Lundberg & Lee (NeurIPS 2017) – A Unified Approach to Interpreting Model Predictions
Sundararajan & Najmi (PMLR 2020) – The many Shapley values for model explanation
Chen et al. (2020) – True to the Model or True to the Data?
Covert et al. (JMLR 2021) – Explaining by removing: a unified framework for model explanation

Probing

Hewitt & Manning (EMNLP-IJCNLP 2019) – Designing and Interpreting Probes with Control Tasks
Voita & Titov (EMNLP 2020) – Information-Theoretic Probing with Minimum Description Length
Elazar et al. (TACL 2020) – Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals
Pimentel et al. (ACL 2020) – Information-Theoretic Probing for Linguistic Structure
Jumelet et al. (ACL 2021) – Language Models Use Monotonicity to Assess NPI Licensing
White et al. (NAACL 2021) – A Non-Linear Structural Probe

Faithfulness

McCoy et al. (ACL 2019) – Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
Hao (BlackboxNLP 2020) – Evaluating attribution methods using white-box LSTMs
Pruthi et al., (TACL 2022) – Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students?
Bastings et al. (EMNLP 2022) – “Will You Find These Shortcuts?” A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification
Madsen et al. (EMNLP 2022) – Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining
Jumelet & Zuidema (ACL 2023) – Feature Interactions Reveal Linguistic Structure in Language Models

On the limitation of general-purpose interpretability methods

Sixt et al. (PMLR 2020) – When Explanations Lie: Why Many Modified BP Attributions Fail
Atanasova et al. (EMNLP 2020) – A Diagnostic Study of Explainability Techniques for Text Classification
Neely et al. (2022) – A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing
Krishna et al. (2022) – The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective
Bilodeau et al. (PNAS 2024) – Impossibility Theorems for Feature Attribution

Part 2: Context Mixing

Limitations of Attention

Clark et al. (BlackboxNLP 2019) – What Does BERT Look at? An Analysis of BERT’s Attention
Bastings & Filippova (BlackboxNLP 2020) – The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
Bibal et al. (ACL 2022) – Is Attention Explanation? An Introduction to the Debate
Hassid et al. (Findings of EMNLP 2022) – How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
Bondarenko et al. (2023) – Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Measures of context-mixing beyond attention

Clark et al. (BlackboxNLP 2019) – What Does BERT Look at? An Analysis of BERT’s Attention
Abnar & Zuidema (ACL 2020) – Quantifying Attention Flow in Transformers
Kobayashi et al. (EMNLP 2020) – Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
Kobayashi et al. (EMNLP 2021) – Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
Ferrando et al. (EMNLP 2022) – Measuring the Mixing of Contextual Information in the Transformer
Modarressi et al. (NAACL 2022) – GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers
Mohebbi et al. (EACL 2023) – Quantifying Context Mixing in Transformers
Chefer et al. (CVPR 2021) – Transformer Interpretability Beyond Attention Visualization
Ferrando et al. (ACL 2023) – Explaining How Transformers Use Context to Build Predictions
Modarressi et al. (ACL 2023) – DecompX: Explaining Transformers Decisions by Propagating Token Decomposition
Mohebbi et al. (EMNLP 2023) – Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers
Kobayashi et al. (ICLR spotlight 2024) – Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Map

Part 3: Mechanistic Interpretability / Circuits

Circuits:

Wang et al. (ICLR 2023) – Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Hanna et al. (NeurIPS 2023) – How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Prakash et al. (ICLR 2024) – Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Merullo et al. (ICLR 2024) – Circuit Component Reuse Across Tasks in Transformer Language Models
Nanda et al. (ICLR 2023) – Progress measures for grokking via mechanistic interpretability

Automated circuit finding:

Conmy et al. (NeurIPS 2023) – Towards Automated Circuit Discovery for Mechanistic Interpretability
Edge Attribution Patching:
- Blog post by Nanda (2023) – Attribution Patching: Activation Patching At Industrial Scale
- Syed et al. (NeurIPS 2023 ATTRIB Workshop) – Attribution Patching Outperforms Automated Circuit Discovery
- Kramár et al. (2024) – AtP*: An efficient and scalable method for localizing LLM behaviour to components
- EAP-IG (coming soon!)

Studies of individual components:

Gould et al. (ICLR 2024) – Successor Heads: Recurring, Interpretable Attention Heads In The Wild
McDougall et al. (2024) – Copy Suppression: Comprehensively Understanding an Attention Head

Mechanistic Interpretability Methods:

Vig et al. (NeurIPS 2020) – Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
Geiger et al. (NeurIPS 2021) – Causal Abstractions of Neural Networks
Goldowsky-Dill et al. (2023) – Localizing Model Behavior with Path Patching
Makelov et al. (ICLR 2024) – Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
Wu et al. (2024) – A Reply to Makelov et al. (2023)’s “Interpretability Illusion” Arguments
Zhang and Nanda (ICLR 2024) – Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Chan et al. (2022) – Rigorously Testing Interpretability Hypotheses Using Causal Scrubbing

Anthropic’s Transformer Circuits Thread:

Elhage et al. (2021) – A Mathematical Framework for Transformer Circuits
Olsson et al. (2022) – In-context Learning and Induction Heads
Elhage et al. (2022) – Toy Models of Superposition
See all work in the thread here

Other mechanistic work:

Meng et al. (NeurIPS 2022) – Locating and Editing Factual Associations in GPT
Hase et al. (NeurIPS 2023) – Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Todd et al. (ICLR 2024) – Function Vectors in Large Language Models
Li et al. (NeurIPS 2022) – Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
- Follow-up blog post by Nanda (2022) – Actually, Othello-GPT Has A Linear Emergent World Representation
Olah / Cammarata et al. (2020) – Circuits Thread