Changelog
Notable changes to CodeBook Lab, newest first. Versions follow the package releases on PyPI and GitHub.
1.4.0
2026-06-25
New features
- Chat continuation modes.
chat_modecontrols how annotation calls share chat history within a run —per_text(one chat per row),per_query(a fresh chat per query), orcontinuous(one chat for the whole run). See Chat continuation modes. - Model reasoning.
reasoning=True/False/Noneis passed through toChatOllama; Lab captures the model’s reasoning (structuredreasoning_content, falling back to<think>…</think>) and writes per-query traces toreasoning_traces.jsonl. See Model reasoning.
Changes
- Sensible classification defaults.
ExperimentSpecnow defaults totemperature=0.0,use_examples=True,chat_mode="per_text", andreasoning=None, and the default retry strategy is nowreprompt. - Run-ID output folders. Runs are written to
outputs/<task>/<run_id>/instead of long configuration-based folder names; the full configuration is stored inconfig.json, and the runs metrics table now recordschat_modeandreasoning. See Outputs & Metrics.
Internal changes
- Updated tests, README, and example scripts.
1.3.0
2026-06-24
New features
- Span annotations. Labelled and unlabelled character-span coding, scored with token F1, exact-match F1, and character IoU. Adds the bundled
discrete-emotionsdemo task. See Annotation Types. - Tidy metrics output. Results are written as two linked tables — a long
<task>_metrics_log.csv(one row per run × column × metric) and a<task>_metrics_log_runs.csvof per-run configuration and per-query efficiency. See Outputs & Metrics.
Bug fixes
- Harmonised invalid-response handling across annotation types: invalid model responses are retried and then stored as null, rather than silently defaulted. See Conditionals & Retries.
Internal changes
- Added unit tests for span value parsing, response extraction, and metrics integration.
1.2.0
2026-06-24
New features
- Human reliability and adjudication.
calculate_human_reliability()andbuild_human_ground_truth()validate coder CSVs, compute inter-coder reliability, surface disagreements, and build a consensus ground truth with an adjudication queue. See Human Reliability.
Internal changes
- A GitHub Release is now created automatically on tag push.
1.1.1
2026-06-08
New features
generate_response()now uses Ollama’s JSON-schema structured-output method for more reliable parsing of model responses.
Internal changes
- Added a PyPI publish workflow, README badges, and the preprint reference.
1.1.0
2026-03-23
Internal changes
- First packaged distribution, with a Zenodo DOI and citation metadata.
1.0.0
2026-03-23
New features
- Initial release: codebook-driven LLM annotation and benchmarking against human labels.
run_experiment()andrun_experiment_grid()sweep model, prompt style, examples, and sampling settings; classification, agreement, and textbox metrics; built-instandard,persona, andCoTprompt wrappers; and the bundledpolicy-sentimenttask.