Skip to main content

Documentation Index

Fetch the complete documentation index at: https://opensre.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

CloudOpsBench is a 452-scenario Kubernetes root-cause-analysis benchmark published by Wang et al (arXiv:2603.00468v1, Feb 2026). The paper uses a State Snapshot paradigm — each fault case is a frozen JSON repository served via mocked kubectl-style tool calls — so every evaluation is bit-for-bit reproducible and needs no live cluster. OpenSRE wraps this corpus through a small reusable benchmark framework that adds cost tracking, integrity guards (pre-registration, per-stratum reporting, negative results, COI disclosure), per-LLM dispatch with version pinning, and self-contained markdown + HTML reports. The goal is to publish the opensre+LLM column against the paper’s LLM-alone baselines on the same scenarios.
Paper baseline                 opensre+LLM (this benchmark)
─────────────                  ──────────────────────────
DeepSeek-V3.2  0.73   A@1  →   target 0.78+
GPT-5          0.67   A@1  →   target 0.78+
GPT-4o         0.49   A@1  →   target 0.65+
Claude-4-Sonnet 0.50  A@1  →   target 0.65+

What you need to run it

CloudOpsBench needs no live infrastructure. The frozen snapshots are the environment.
# 1. Python 3.12+ (project standard)

# 2. Benchmark-dedicated LLM API keys — keep separate from production opensre keys
export ANTHROPIC_API_KEY=...        # Claude-4-Sonnet via Anthropic direct
export OPENAI_API_KEY=...           # GPT-5, GPT-4o
export DEEPSEEK_API_KEY=...         # DeepSeek-V3.2
export TOGETHER_API_KEY=...         # Qwen (optional)

# 3. Pull the corpus (one-time, a few hundred MB)
make download-cloudopsbench-hf
You do not need: AWS credentials, an EKS cluster, kind/minikube, Bedrock, GPU, Grafana, Datadog, or Prometheus.

Quick start

List adapters

uv run python -m tests.benchmarks._framework.cli list

Validate a config

uv run python -m tests.benchmarks._framework.cli validate \
    tests/benchmarks/configs/example.yml
The config lint catches anti-patterns (runs_per_case < 3, missing pre_registration_path, oversized grids, system-path output_dir). Validation returns non-zero on any failure.

Dev-mode run

--dev skips the integrity gates so you can smoke-test the wiring without writing a pre-registration file. The run ID gets a dev- prefix so dev results can’t be silently promoted.
uv run python -m tests.benchmarks._framework.cli run \
    tests/benchmarks/configs/example.yml --dev

Production run

A production run requires:
  • A pre-registration YAML at pre_registration_path listing per-model expected deltas, committed to git before the run starts (integrity Mechanism 1)
  • seed: set in config (Mechanism 6)
  • Adapter declaration of data_contamination_checked = True (Mechanism 7)
  • At least one validity metric declared by the adapter (Mechanism 3)
uv run python -m tests.benchmarks._framework.cli run \
    tests/benchmarks/configs/cloudopsbench_v1.yml
On completion, the run directory contains report.json (machine-readable), report.md (human-readable summary), report.html (self-contained, no external CSS/JS), and cases/*.json (per-cell artifacts).

Re-render an existing report

uv run python -m tests.benchmarks._framework.cli report \
    .bench-results/example/<run-dir>/

Config reference

benchmark: cloudopsbench

modes:
  - opensre+llm          # opensre wrapping the LLM
  # - llm_alone          # paper provides LLM-alone numbers; rerun only if not trusting them

llms:
  - claude-4-sonnet
  - deepseek-v3.2
  - gpt-5
  - gpt-4o

model_versions:          # pinned to exact provider snapshots
  claude-4-sonnet: claude-sonnet-4-5-20250929
  deepseek-v3.2:   deepseek-chat-v3.2
  gpt-5:           gpt-5-2025-08-07
  gpt-4o:          gpt-4o-2024-11-20

runs_per_case: 3         # replication for variance estimate (Box-Hunter-Hunter Ch 3.4)
workers: 4               # serial across LLMs, parallel within
cost_budget_usd: 1000    # hard cap; run aborts cleanly when exceeded
seed: 42                 # required for reproducible case selection (M6)

filters:                 # optional case subsetting
  systems: [boutique]
  difficulty: [hard, medium]

output_dir: .bench-results/cloudopsbench-v1/
report_formats: [json, markdown, html]
pre_registration_path: tests/benchmarks/configs/preregistrations/v1.yml

Env-var overrides for CI

These let CI override knobs without editing the YAML:
VariablePurpose
OPENSRE_BENCH_WORKERSOverride workers:
OPENSRE_BENCH_COST_BUDGET_USDOverride cost_budget_usd:

Integrity guarantees

The framework enforces 11 honest-results mechanisms at the code level. There is no bypass short of editing the framework itself.

Pre-flight (before any case runs)

IntegrityGuard.pre_flight raises IntegrityViolation if any of these hold:
  • M1 — Pre-registration: pre_registration_path unset, missing, or empty. Forces the engineer to commit expected deltas before seeing results.
  • M3 — Validity metrics: adapter declares no validity metric (no Streetlight Effect).
  • M6 — Seeded selection: seed: is None (no cherry-picking).
  • M7 — Contamination check: adapter has not declared data_contamination_checked = True.
All violations surface in a single exception so the engineer fixes everything in one pass, not one-fix-rerun-discover-next.

Report-validation (before the report is emitted)

IntegrityGuard.report_validation refuses to publish a report if:
  • M3 — Not every adapter-declared metric is in the report
  • M4 — Per-stratum breakdown missing or contains only all (no aggregate-only reporting)
  • M5 — Raw per-case artifacts directory missing
  • M9negative_results is empty
  • M10coi_disclosure is empty
  • M1 — Pre-registration path not carried into the report

Two more mechanisms are operational, not code-enforced

  • M8 — External replication of ≥1 cell by a third party before public claim
  • M11 — Blinded LLM-as-judge calibration (BDIL Phase B; tracked separately)

Cost tracking

The framework registers a usage hook on app/services/llm_client.py’s LLMClient, OpenAILLMClient, and BedrockLLMClient. Every successful LLM call feeds (model, tokens_in, tokens_out) into a CostTracker. The tracker enforces the configured cost_budget_usd as a hard cap — the next call that would exceed budget raises CostBudgetExceeded and the runner halts cleanly with a partial-completion report. Per-cell tokens_in / tokens_out / cost_usd is currently 0 (aggregate cost is correct; per-cell delta capture is a follow-up). Total run cost in report.json is honest.

Metrics

Paper’s 13 deterministic metrics plus 3 framework-added validity metrics:
FamilyMetricSource
Outcomea1, a3, tcr, exact, in_order, any_orderPaper § 4.2.1
Process — alignmentrel, covPaper § 4.2.2
Process — efficiencysteps, mttiPaper § 4.2.2
Process — robustnessiac, rar, ztdrPaper § 4.2.2
Validitycitation_grounding_rate, entity_existence_rate, kubectl_actionability_rateFramework (regex + universe check)
All 16 metrics are deterministic (string / set comparison) — no LLM-as-judge at evaluation time.

Existing production entry points

make test-cloudopsbench and opensre tests cloudopsbench route through tests/benchmarks/cloudopsbench/run_suite.py, which is the legacy imperative-CLI surface. The framework runner is the new YAML-config surface and coexists with it during the transition. Both call into the same adapter, scoring code, and replay backend.

Reference