Outputs And Metrics

Each run creates a directory named by a short run ID under outputs/<task>/<run_id>/. Run-ID folders keep paths compact; the full set of choices for the run lives in config.json and in the runs metrics table.

outputs/
  policy-sentiment/
    a1b2c3d4/
      output.csv
      config.json
      classification_reports.txt
      emissions.csv
      timing_data.json
      char_counts.json
      reasoning_traces.jsonl
  metrics/
    policy-sentiment_metrics_log.csv
    policy-sentiment_metrics_log_runs.csv

Per-Run Files

output.csv
Row-level model annotations.
config.json
The full run configuration — model, prompt, sampling, chat mode, reasoning, and task paths — so a run-ID folder can always be traced back to its settings.
classification_reports.txt
Human-readable evaluation summaries by annotation field.
emissions.csv
CodeCarbon energy and emissions output.
timing_data.json
Total inference time, average inference time, and request counts.
char_counts.json
Total input and output characters sent to and returned from the model.
reasoning_traces.jsonl
One record per query capturing the model’s reasoning content, when the model returns it (see Model reasoning).

Tidy Metrics Tables

Aggregate metrics are written as two linked tidy tables:

<task>_metrics_log.csv
The long table — one row per (run_id, annotation column, annotation type, metric) with the metric value.
<task>_metrics_log_runs.csv
The runs table — one row per run_id with model, prompt, sampling, chat mode, and reasoning configuration, task paths, inference counts, character counts, energy use, emissions, CPU/GPU metadata, and per-query averages.

This split keeps metric values long and analysis-friendly while avoiding repeated run metadata on every metric row.

Depending on the annotation types in your codebook, metrics can include:

  • accuracy, precision, recall, F1, and percentage agreement
  • Cohen’s kappa and Krippendorff’s alpha
  • Spearman correlation and quadratic weighted kappa for Likert fields
  • normalized Levenshtein similarity and BLEU for textbox fields
  • ROUGE, embedding cosine similarity, and BERTScore when optional textbox extras are installed
  • token F1, exact-match F1, and character IoU for span fields
  • total inference time, average inference time, number of queries, character counts, energy use, and emissions

Linking the two tables

run_id is the key shared by both tables, so you can join metric values to the configuration that produced them with a single merge:

import pandas as pd

metrics = pd.read_csv("outputs/metrics/policy-sentiment_metrics_log.csv")
runs = pd.read_csv("outputs/metrics/policy-sentiment_metrics_log_runs.csv")

joined = metrics.merge(runs, on="run_id")

# e.g. macro-F1 on Direction by model and chat mode
view = joined[
    (joined["column"] == "Policy Sentiment_Direction")
    & (joined["metric"] == "f1_macro")
]
print(view[["model_id", "chat_mode", "reasoning", "use_examples", "value"]])

Because every metric row carries a run_id, you can pivot, group, and compare runs along any configuration dimension — model, prompt, examples, chat mode, reasoning, or sampling — all scored against the same human labels.

Choosing Metrics

For checkbox and dropdown annotations, start with F1 and percentage agreement. For imbalanced labels, inspect per-class precision and recall in classification_reports.txt.

For Likert annotations, report ordinal-aware metrics alongside exact agreement. A model that predicts adjacent categories may be less wrong than one that jumps from one end of the scale to the other.

For textbox annotations, treat similarity metrics as diagnostics rather than a single ground-truth measure. Short evidence extracts can be semantically equivalent while using different words.

For span annotations, token_f1 is the headline score for most comparisons because it rewards overlap at the token level. Use exact_match_f1 when boundary precision matters and char_iou when near misses should still be visible.