Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
with
James P. Cross,
Zuzanna Krakowska,
Robin Rauner and Martijn Schoonvelde
[Under Review]
Preprint
Abstract
Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how task and model choice, model size, learning approach, or prompt style interact, and whether popular “best practices” survive controlled comparison, are largely neglected. We evaluate these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide to both cost and performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects. We use these results to develop a validation-first framework with a principled ordering of pipeline decisions, reporting standards, and open-source tools.