Changelog

Notable changes to CodeBook Lab, newest first. Versions follow the package releases on PyPI and GitHub.

1.4.0

2026-06-25

Chat continuation modes. chat_mode controls how annotation calls share chat history within a run — per_text (one chat per row), per_query (a fresh chat per query), or continuous (one chat for the whole run). See Chat continuation modes.
Model reasoning. reasoning=True/False/None is passed through to ChatOllama; Lab captures the model’s reasoning (structured reasoning_content, falling back to <think>…</think>) and writes per-query traces to reasoning_traces.jsonl. See Model reasoning.

Sensible classification defaults. ExperimentSpec now defaults to temperature=0.0, use_examples=True, chat_mode="per_text", and reasoning=None, and the default retry strategy is now reprompt.
Run-ID output folders. Runs are written to outputs/<task>/<run_id>/ instead of long configuration-based folder names; the full configuration is stored in config.json, and the runs metrics table now records chat_mode and reasoning. See Outputs & Metrics.

2026-06-24

Span annotations. Labelled and unlabelled character-span coding, scored with token F1, exact-match F1, and character IoU. Adds the bundled discrete-emotions demo task. See Annotation Types.
Tidy metrics output. Results are written as two linked tables — a long <task>_metrics_log.csv (one row per run × column × metric) and a <task>_metrics_log_runs.csv of per-run configuration and per-query efficiency. See Outputs & Metrics.

Harmonised invalid-response handling across annotation types: invalid model responses are retried and then stored as null, rather than silently defaulted. See Conditionals & Retries.

Added unit tests for span value parsing, response extraction, and metrics integration.

2026-06-24

Human reliability and adjudication. calculate_human_reliability() and build_human_ground_truth() validate coder CSVs, compute inter-coder reliability, surface disagreements, and build a consensus ground truth with an adjudication queue. See Human Reliability.

2026-06-08

generate_response() now uses Ollama’s JSON-schema structured-output method for more reliable parsing of model responses.

2026-03-23

2026-03-23

Initial release: codebook-driven LLM annotation and benchmarking against human labels. run_experiment() and run_experiment_grid() sweep model, prompt style, examples, and sampling settings; classification, agreement, and textbox metrics; built-in standard, persona, and CoT prompt wrappers; and the bundled policy-sentiment task.