7 Generative Language Models II

Week 7

Lecture: Prompt engineering, hyperparameters, and LLMs vs. human annotation
Lab: Prompt engineering and API usage in Python

7.1 Lecture

This week continues our exploration of generative language models, with a focus on practical prompt engineering techniques and the hyperparameters that control model output. These skills are immediately applicable to course projects and are critical for anyone using LLMs in computational social science research.

7.1.1 LLMs vs. human annotators

Before turning to prompt engineering, it is worth considering how LLMs compare to human annotators for text classification tasks. Gilardi, Alizadeh, and Kubli (2023) compared the performance of ChatGPT against trained annotators and Amazon Mechanical Turk crowd workers across several datasets, measuring both accuracy and intercoder agreement. Their results showed that LLMs often match or exceed human performance on straightforward tasks such as relevance classification, particularly when temperature settings are kept low (e.g. 0.2 rather than 1.0). However, human validation remains essential for more complex judgements involving framing or nuanced conceptual distinctions. These findings have important implications for research scalability and efficiency, but they do not eliminate the need for careful validation.

7.1.2 Prompt engineering

Prompt engineering is the practice of optimising inputs to get better outputs from generative models. It is both an art and a science, requiring experimentation and creativity alongside systematic evaluation. Small changes to a prompt can dramatically affect output quality — the same sentiment analysis task can yield verbose and inconsistent results with one prompt formulation, and clean, structured outputs with another. The goal is reliable, consistent outputs that meet research standards.

7.1.2.1 Prompting fundamentals

Three simple techniques often outperform switching to a larger model:

Specificity. Vague prompts lead to vague answers. Instead of “Analyse this speech,” a more effective prompt would be “Analyse the economic policy positions in this speech using the CAP codebook.” Precision about requirements, categories, and expected output format substantially improves results.

Reducing hallucination. Explicitly instructing models to indicate uncertainty reduces hallucination. Telling the model to respond with “I don’t know” or “N/A” when it is unsure helps prevent fabricated outputs.

Instruction order. LLMs exhibit primacy and recency effects: they tend to attend most carefully to instructions placed at the beginning or end of a prompt, and may lose track of information in the middle. Place the most important instructions accordingly.

7.1.2.2 Few-shot prompting

Few-shot prompting teaches the model by example. In a zero-shot prompt, no examples are provided and the model relies entirely on pre-training. In a one-shot prompt, a single example guides the response format. In a few-shot prompt, multiple examples establish a clear pattern. Models can learn patterns from just two or three examples, and few-shot prompting is particularly effective for specialised tasks, domain-specific vocabulary, and ensuring consistent formatting in research applications. Examples should represent the full range of expected inputs.

Dynamic few-shot prompting is an advanced variant. Rather than using fixed examples for all queries, it selects examples from a larger pool based on their similarity to the current query. This requires computing semantic similarity between the query and available examples and selecting the most relevant ones for each prompt. It is more complex to implement but often more effective when inputs are diverse.

7.1.2.3 Prompt components

A well-structured complex prompt may include several components, each serving a specific purpose: a persona that establishes expertise and role (e.g. “You are an expert in large language models”), an instruction that states the main task, context that provides background and constraints, a format specification for the desired output structure, an audience definition, a tone directive, and the data to be processed. Not all elements are needed for every task, but research applications often require more structured prompts than casual use.

In code, a modular approach defines each component as a separate variable. The final prompt is assembled by concatenating the components. This makes it easy to experiment by changing one component at a time, improves maintainability, and supports systematic testing of component variations to identify optimal configurations. The final prompt structure should be documented in research methods sections.

7.1.2.4 Chain-of-thought prompting

Chain-of-thought (CoT) prompting asks the model to show its reasoning steps explicitly. Rather than producing an answer directly, the model walks through the logic that leads to its conclusion. This is particularly effective for mathematical problems, logical reasoning, multi-step analysis, and policy analysis with multiple interacting factors. CoT prompting improves accuracy and provides transparency into the model’s reasoning process. It can be combined with few-shot examples: providing worked examples that include reasoning steps teaches the model to apply the same approach to new problems.

7.1.2.5 Self-consistency

Self-consistency is a technique for improving accuracy on difficult problems. The approach generates multiple independent solutions to the same problem (each using chain-of-thought reasoning), then takes a majority vote to determine the final answer. This reduces the impact of random errors in reasoning and is particularly effective for classification of ambiguous cases. The tradeoff is increased computation time and cost, but for high-stakes applications requiring reliability, the improvement in accuracy can justify the expense.

7.1.2.6 Iterative prompt development

Prompt engineering is inherently iterative. The process mirrors the scientific method: start with a hypothesis (a prompt that should improve output), test it (run the prompt), analyse the results, and refine. Starting simple and adding complexity as needed, while documenting iterations to understand what works and why, is the most effective approach. This process is critical for developing reproducible research methods.

7.1.2.7 Structured output

For research applications, generative LLM responses are often most useful in machine-readable formats. Common formats include JSON, XML, and YAML. Structured outputs are easier to validate, analyse, and integrate into automated processing pipelines, and they ensure consistent format across multiple queries. When prompting for structured output, always demonstrate the desired format in the prompt, include examples with the exact keys and structure expected, and clearly specify data types for each field.

7.1.2.8 System and user prompts

When interacting with models through APIs, prompts are typically divided into two components. The system prompt defines the persona, expertise, and overall guidelines and constraints. The user prompt contains the specific text to be analysed and may include examples or criteria. This separation enables consistent behaviour across multiple queries and reusable system prompts for related tasks. For research, maintaining consistent system prompts and documenting both components is important for reproducibility. Not all models support this separation, in which case everything can be included in a single prompt. Whether separating system and user prompts makes a meaningful difference to performance is not yet entirely clear.

7.1.3 Hyperparameters

Generative language models are trained to predict the next token given a sequence of previous tokens. When generating text, the model produces a probability distribution over possible next tokens and selects one according to that distribution. Several hyperparameters control how this selection is made, and they significantly affect model outputs.

7.1.3.1 Temperature

Temperature controls the randomness of token selection. Technically, it is a scaling factor applied to the raw model outputs (logits) before they are converted into probabilities. At low temperature, the probability distribution becomes sharper: the model focuses on the highest-probability tokens, producing more deterministic, predictable, and conservative responses. At high temperature, the distribution flattens: lower-probability tokens have a greater chance of being selected, producing more diverse, creative, and unexpected responses. A temperature of 0 yields completely deterministic output (greedy decoding, always selecting the most probable token), while a temperature of 1 samples directly from the unmodified probability distribution. For most research classification tasks, low temperature settings are preferable.

7.1.3.2 Top-p (nucleus sampling)

Top-p, also known as nucleus sampling, offers an alternative way to control randomness. It selects from the smallest set of tokens whose cumulative probability exceeds the threshold p. At low top-p, only the most likely tokens are considered, producing focused and conservative outputs. At high top-p, a wider range of tokens enters the candidate pool, leading to more diverse vocabulary. Top-p often provides more intuitive control than temperature and complements it. Values typically range from 0.1 to 1.0.

7.1.3.3 Choosing settings

The appropriate settings depend on the task. For brainstorming or exploratory work, high temperature and high top-p produce diverse and creative outputs. For formal writing or email generation, low temperature and low top-p yield predictable, focused results. For creative writing, high temperature with low top-p allows creativity while maintaining coherence. For translation or technical tasks, low temperature with high top-p provides accuracy with vocabulary variety. For research classification, low temperature is almost always appropriate. These settings should be documented in research methods sections, and researchers should experiment to find optimal configurations for their specific tasks.

7.1.4 Conceptual clarity

The convenience and scalability of LLMs for text classification does not ensure accuracy. Traditional validation methods — the kind discussed in Chapter 5 — remain essential when using LLMs instead of, or alongside, human coders. Clear conceptual definitions are even more critical with generative models: the researcher should never assume that the model understands a concept in the same way they do. Outputs should always be validated against human judgements, established measures, and ground truth when available, and validation procedures should be documented thoroughly.

7.2 Lab

Content to be added.

7.3 Readings

Törnberg (2024)
Hackenburg and Margetts (2024)