10 Interpretability and Explainability

Week 10

Lecture: Interpreting and explaining machine learning and LLM outputs
Lab: Applying explainability methods to language models

10.1 Lecture

This week addresses a fundamental challenge in machine learning and LLM research: understanding why models make the decisions they do. While some model types are inherently interpretable, many of the most powerful models — including transformer-based language models — operate as “black boxes.” The lack of interpretability has real consequences, and language models are particularly sensitive: minor changes to prompts can lead to dramatically different outputs, creating a need for robust explanation methods.

10.1.1 The challenge of interpretation

In machine learning, there is a fundamental tension between performance and interpretability. Models like decision trees or linear regression are inherently interpretable, allowing us to trace exactly how inputs lead to outputs. Deep learning models, by contrast, often achieve higher performance but at the cost of transparency. This matters because systems that lack interpretability may fail to generalise or may achieve their goals through unintended means.

A well-known example involves an image classifier that was trained to distinguish huskies from wolves. When the model was examined using explainability techniques, researchers found it had learned to associate wolves with snow in the background rather than with the actual features of the animal (Ribeiro, Singh, and Guestrin 2016). Similarly, a pneumonia identification model was found to have learned associations between certain scan features and specific hospitals that had relatively higher or lower rates of pneumonia cases, rather than learning to identify pneumonia itself (Zech et al. 2018). Without interpretability tools, these failures would have gone undetected.

With language models specifically, their sensitivity to inputs presents unique challenges. Minor changes to prompts can lead to dramatically different outputs, creating a need for robust explanation methods to understand these behaviours.

10.1.2 Why interpret or explain?

Explanation serves multiple critical functions. For model developers, explanations help with debugging errors and biases by identifying the sources of problems, enabling targeted interventions rather than wholesale retraining. Explanation also supports targeted model improvements and helps in understanding capabilities and limitations — clarifying when and why models succeed or fail. It is essential for explaining hallucinations and uncertainty, helping users calibrate their trust appropriately. In applied settings, explanation is critical for ensuring accountability in AI-assisted decisions, and from a legal perspective, the EU’s General Data Protection Regulation includes provisions that can be interpreted as creating a “right to explanation” for automated decisions that significantly affect individuals.

For political science specifically, explanations help us understand why certain documents were classified in particular ways, can reveal potential political leanings embedded in model outputs, and illuminate the reasoning behind automated assessments.

10.1.3 Interpretability vs. explainability

Understanding the distinction between these two concepts is crucial for approaching model transparency effectively. Interpretability refers to the inherent transparency of a model’s decision-making process. An interpretable model like a decision tree allows us to directly trace how inputs map to outputs without additional tools; the understanding is immediate and the transparency is built into the model design. Explainability, on the other hand, involves post-hoc methods that can help explain any model, including complex black boxes. Techniques like SHAP values or LIME create local explanations of specific predictions, providing approximations or simplified representations of model behaviour after the model has been trained.

This distinction has practical implications. When choosing models for high-stakes applications, interpretable models may be preferable if transparency is critical. However, when both performance and understanding are needed, explainability techniques offer a compromise. For political science applications, interpretable models may be preferable for public-facing policy applications, while explainability techniques might be needed when analysing complex language models in research settings.

10.1.4 Two paradigms of LLM explainability

It is important to distinguish between two fundamentally different paradigms. The traditional fine-tuning paradigm involves taking a pre-trained encoder model and adapting it to specific tasks; here, the goal is often to explain classification decisions or other discrete outputs. The prompting paradigm, which has emerged with large decoder-only models like GPT-4 or Claude, uses the model’s generative capabilities directly through carefully crafted prompts. The explanation challenges are different in each case, and the methods covered below are organised accordingly.

10.1.5 Encoder-only models: the fine-tuning paradigm

10.1.5.1 Local explanation methods

Local explanation methods focus on explaining specific predictions rather than the model’s overall behaviour. They are particularly useful for stakeholders who need to understand individual decisions. There are three major approaches.

Feature attribution methods assign importance scores to input features, showing which parts of the text most influenced a model’s decision. These include perturbation-based approaches that measure output changes when inputs are modified, gradient-based methods like integrated gradients that trace the flow of information, model-agnostic approaches like LIME that approximate complex models with simpler ones, and game-theoretic approaches like SHAP that distribute credit among features. Feature attribution is particularly valuable for identifying when models rely on spurious correlations rather than meaningful patterns. The results are typically visualised through heatmaps or highlighting, making them accessible to non-technical stakeholders.

Attention-based explanation leverages the attention mechanisms central to transformer-based LLMs. Attention patterns can be visualised as attention maps or bipartite graphs, revealing both syntactic dependencies (like subject-verb relationships) and semantic associations. There is an ongoing debate about whether attention weights constitute faithful explanations of model behaviour — some researchers argue they provide meaningful insights, while others point out that attention does not always correlate with feature importance. Despite this debate, attention visualisations can be valuable complementary tools alongside other explanation methods. In essence, attention-based explanations show relationships between elements of the input (how the model processes information), while feature attribution explanations show the importance of each input element to the final output (what the model uses to make decisions).

Example-based explanation leverages the idea that similar examples can provide intuitive understanding of model behaviour. Adversarial examples reveal model vulnerabilities by showing minimal input changes that drastically change outputs. Counterfactuals demonstrate what changes to an input would alter its prediction, providing insights into decision boundaries. Data influence methods identify which training examples most affected a specific prediction, connecting model behaviour to its learning history. These approaches are particularly valuable for non-technical stakeholders who find concrete examples more intuitive than abstract feature importances.

10.1.5.2 Global explanation methods

While local methods explain specific predictions, global methods aim to understand model behaviour across all possible inputs — attempting to look inside the black box.

Probing-based explanation tests what linguistic knowledge is encoded in model representations by training simple classifiers on these representations to predict specific linguistic properties. Research has revealed a fascinating hierarchical pattern across model layers: lower layers typically encode word-level syntax and local dependencies, middle layers capture sentence-level syntax and more complex relationships, and higher layers represent semantic information and broader context. This hierarchical organisation mirrors the way humans process language, suggesting that models may be learning similar abstraction patterns despite their different architecture.

Neuron activation explanation analyses individual neurons or dimensions in model representations to identify specialised functions. Recent work has found that LLMs can actually explain the functions of neurons in other LLMs, providing human-interpretable descriptions of what specific components detect. This approach has been particularly valuable for understanding emergent capabilities — behaviours that were not explicitly designed but arise from scale and training. For political scientists, these methods could help identify when and how politically relevant concepts are encoded in model representations.

Mechanistic interpretability represents the most ambitious approach, aiming to reverse-engineer exactly how models work internally. This approach studies model components as “circuits” — functional units that perform specific tasks — and analyses connections between neurons and attention heads to understand information flow. The main challenge is scaling this analysis to larger models with billions of parameters. Current work focuses on identifying general patterns and principles rather than comprehensive explanations. This emerging field may eventually allow us to understand model reasoning at a level comparable to human thought processes, though significant challenges remain.

10.1.6 Decoder-only models: the prompting paradigm

As we shift to generative models, explanation methods focus on different aspects of model behaviour.

10.1.6.1 Explaining in-context learning

In-context learning — the ability to adapt to tasks from examples in the prompt without weight updates — remains somewhat mysterious despite its practical success. Research shows that large models can override their semantic priors with just a few examples, while smaller models rely more heavily on pre-training knowledge. Understanding these mechanisms helps design better prompts and predict when in-context learning will succeed or fail. This capability sits at the intersection of memorisation and generalisation, raising interesting questions about what constitutes “learning” in these systems.

10.1.6.2 Self-explanation and chain of thought

An exciting development in LLM explainability is that the models themselves can generate natural-language explanations of their reasoning processes. Using chain-of-thought prompting, we can elicit step-by-step reasoning that reveals how the model arrives at conclusions. Models can also provide post-hoc explanations of already-completed predictions, and even engage in counterfactual reasoning about how outputs would change if inputs were different.

The critical research question is whether these explanations faithfully represent the actual computational processes happening within the model, or whether they are merely plausible narratives constructed after the fact. Some evidence suggests these explanations can be faithful to some degree, but they may also rationalise decisions made through other mechanisms. Despite this uncertainty, self-explanations provide immediate practical value by making model decisions more transparent to users without requiring technical understanding of model internals.

10.1.6.3 Representation engineering

Representation engineering takes a top-down approach to understanding and controlling model behaviour by identifying and manipulating high-level concepts in the model’s representation space. By identifying concept vectors that correspond to specific ideas or attributes, researchers can steer model outputs by adding or subtracting these vectors from internal states. This approach has proven powerful for model alignment and control, allowing targeted interventions without wholesale retraining. For political scientists, representation engineering could potentially help mitigate political biases or adapt models to specific political contexts.

10.1.7 Evaluating explanations

A critical challenge in explanation research is evaluation: how do we know if explanations are good? Two key criteria are plausibility (do explanations make sense to humans and align with their mental models?) and faithfulness (do explanations accurately reflect the model’s actual decision process?). There is often tension between these criteria. Simple, human-understandable explanations may not faithfully capture complex model processes, while perfectly faithful explanations might be too complex for human comprehension. This tension is particularly acute in self-explanation, where the model generates its own reasoning narratives that may be plausible without being faithful.

10.2 Lab

Content to be added.

10.3 Readings

Ribeiro, Singh, and Guestrin (2016)
Zhao et al. (2023)