Outputs And Metrics
Each run creates a directory named by a short run ID under outputs/<task>/<run_id>/. Run-ID folders keep paths compact; the full set of choices for the run lives in config.json and in the runs metrics table.
outputs/
policy-sentiment/
a1b2c3d4/
output.csv
config.json
classification_reports.txt
emissions.csv
timing_data.json
char_counts.json
reasoning_traces.jsonl
metrics/
policy-sentiment_metrics_log.csv
policy-sentiment_metrics_log_runs.csv
Per-Run Files
output.csv- Row-level model annotations.
config.json- The full run configuration — model, prompt, sampling, chat mode, reasoning, and task paths — so a run-ID folder can always be traced back to its settings.
classification_reports.txt- Human-readable evaluation summaries by annotation field.
emissions.csv- CodeCarbon energy and emissions output.
timing_data.json- Total inference time, average inference time, and request counts.
char_counts.json- Total input and output characters sent to and returned from the model.
reasoning_traces.jsonl- One record per query capturing the model’s reasoning content, when the model returns it (see Model reasoning).
Tidy Metrics Tables
Aggregate metrics are written as two linked tidy tables:
<task>_metrics_log.csv-
The long table — one row per
(run_id, annotation column, annotation type, metric)with the metric value. <task>_metrics_log_runs.csv-
The runs table — one row per
run_idwith model, prompt, sampling, chat mode, and reasoning configuration, task paths, inference counts, character counts, energy use, emissions, CPU/GPU metadata, and per-query averages.
This split keeps metric values long and analysis-friendly while avoiding repeated run metadata on every metric row.
Depending on the annotation types in your codebook, metrics can include:
- accuracy, precision, recall, F1, and percentage agreement
- Cohen’s kappa and Krippendorff’s alpha
- Spearman correlation and quadratic weighted kappa for Likert fields
- normalized Levenshtein similarity and BLEU for textbox fields
- ROUGE, embedding cosine similarity, and BERTScore when optional textbox extras are installed
- token F1, exact-match F1, and character IoU for span fields
- total inference time, average inference time, number of queries, character counts, energy use, and emissions
Linking the two tables
run_id is the key shared by both tables, so you can join metric values to the configuration that produced them with a single merge:
import pandas as pd
metrics = pd.read_csv("outputs/metrics/policy-sentiment_metrics_log.csv")
runs = pd.read_csv("outputs/metrics/policy-sentiment_metrics_log_runs.csv")
joined = metrics.merge(runs, on="run_id")
# e.g. macro-F1 on Direction by model and chat mode
view = joined[
(joined["column"] == "Policy Sentiment_Direction")
& (joined["metric"] == "f1_macro")
]
print(view[["model_id", "chat_mode", "reasoning", "use_examples", "value"]])Because every metric row carries a run_id, you can pivot, group, and compare runs along any configuration dimension — model, prompt, examples, chat mode, reasoning, or sampling — all scored against the same human labels.
Choosing Metrics
For checkbox and dropdown annotations, start with F1 and percentage agreement. For imbalanced labels, inspect per-class precision and recall in classification_reports.txt.
For Likert annotations, report ordinal-aware metrics alongside exact agreement. A model that predicts adjacent categories may be less wrong than one that jumps from one end of the scale to the other.
For textbox annotations, treat similarity metrics as diagnostics rather than a single ground-truth measure. Short evidence extracts can be semantically equivalent while using different words.
For span annotations, token_f1 is the headline score for most comparisons because it rewards overlap at the token level. Use exact_match_f1 when boundary precision matters and char_iou when near misses should still be visible.