> ## Documentation Index > Fetch the complete documentation index at: https://opensre.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # CloudOpsBench benchmark > Run opensre+LLM against the 452-scenario Cloud-OpsBench corpus (Wang et al, arXiv:2603.00468v1) and compare to published LLM-alone baselines. ## Overview CloudOpsBench is a 452-scenario Kubernetes root-cause-analysis benchmark published by Wang et al ([arXiv:2603.00468v1](https://arxiv.org/abs/2603.00468), Feb 2026). The paper uses a *State Snapshot* paradigm — each fault case is a frozen JSON repository served via mocked `kubectl`-style tool calls — so every evaluation is bit-for-bit reproducible and needs no live cluster. OpenSRE wraps this corpus through a small reusable benchmark framework that adds cost tracking, integrity guards (pre-registration, per-stratum reporting, negative results, COI disclosure), per-LLM dispatch with version pinning, and self-contained markdown + HTML reports. The goal is to publish the opensre+LLM column against the paper's LLM-alone baselines on the same scenarios. ```text theme={null} Paper baseline opensre+LLM (this benchmark) ───────────── ────────────────────────── DeepSeek-V3.2 0.73 A@1 → target 0.78+ GPT-5 0.67 A@1 → target 0.78+ GPT-4o 0.49 A@1 → target 0.65+ Claude-4-Sonnet 0.50 A@1 → target 0.65+ ``` ## What you need to run it CloudOpsBench needs **no live infrastructure**. The frozen snapshots are the environment. ```bash theme={null} # 1. Python 3.12+ (project standard) # 2. Benchmark-dedicated LLM API keys — keep separate from production opensre keys export ANTHROPIC_API_KEY=... # Claude-4-Sonnet via Anthropic direct export OPENAI_API_KEY=... # GPT-5, GPT-4o export DEEPSEEK_API_KEY=... # DeepSeek-V3.2 # 3. Pull the corpus (one-time, a few hundred MB) make download-cloudopsbench-hf ``` You do **not** need: AWS credentials, an EKS cluster, kind/minikube, Bedrock, GPU, Grafana, Datadog, or Prometheus. ## Quick start ### List adapters ```bash theme={null} uv run python -m tests.benchmarks._framework.cli list ``` ### Validate a config ```bash theme={null} uv run python -m tests.benchmarks._framework.cli validate \ tests/benchmarks/configs/cloudopsbench_smoke.yml ``` The config lint catches anti-patterns (`runs_per_case < 3`, missing `pre_registration_path`, oversized grids, system-path `output_dir`). Validation returns non-zero on any failure. ### Dev-mode run `--dev` skips the integrity gates so you can smoke-test the wiring without writing a pre-registration file. The run ID gets a `dev-` prefix so dev results can't be silently promoted. ```bash theme={null} uv run python -m tests.benchmarks._framework.cli run \ tests/benchmarks/configs/cloudopsbench_smoke.yml --dev ``` ### Production run A production run requires: * A **pre-registration YAML** at `pre_registration_path` listing per-model expected deltas, committed to git before the run starts (integrity Mechanism 1) * `seed:` set in config (Mechanism 6) * Adapter declaration of `data_contamination_checked = True` (Mechanism 7) * At least one validity metric declared by the adapter (Mechanism 3) ```bash theme={null} uv run python -m tests.benchmarks._framework.cli run \ tests/benchmarks/configs/cloudopsbench_v1.yml ``` On completion, the run directory contains `report.json` (machine-readable), `report.md` (human-readable summary), `report.html` (self-contained, no external CSS/JS), and `cases/*.json` (per-cell artifacts). ### Re-render an existing report ```bash theme={null} uv run python -m tests.benchmarks._framework.cli report \ .bench-results/example// ``` ## Config reference ```yaml theme={null} benchmark: cloudopsbench modes: - opensre+llm # opensre wrapping the LLM # - llm_alone # paper provides LLM-alone numbers; rerun only if not trusting them llms: - claude-4-sonnet - deepseek-v3.2 - gpt-5 - gpt-4o model_versions: # pinned to exact provider snapshots claude-4-sonnet: claude-sonnet-4-5-20250929 deepseek-v3.2: deepseek-chat-v3.2 gpt-5: gpt-5-2025-08-07 gpt-4o: gpt-4o-2024-11-20 runs_per_case: 3 # replication for variance estimate (Box-Hunter-Hunter Ch 3.4) workers: 4 # serial across LLMs, parallel within cost_budget_usd: 1000 # hard cap; run aborts cleanly when exceeded seed: 42 # required for reproducible case selection (M6) filters: # optional case subsetting systems: [boutique] difficulty: [hard, medium] output_dir: .bench-results/cloudopsbench-v1/ report_formats: [json, markdown, html] pre_registration_path: tests/benchmarks/configs/preregistrations/v1.yml ``` ### Env-var overrides for CI These let CI override knobs without editing the YAML: | Variable | Purpose | | ------------------------------- | --------------------------- | | `OPENSRE_BENCH_WORKERS` | Override `workers:` | | `OPENSRE_BENCH_COST_BUDGET_USD` | Override `cost_budget_usd:` | ## Integrity guarantees The framework enforces 11 honest-results mechanisms at the code level. There is no bypass short of editing the framework itself. ### Pre-flight (before any case runs) `IntegrityGuard.pre_flight` raises `IntegrityViolation` if any of these hold: * **M1 — Pre-registration**: `pre_registration_path` unset, missing, or empty. Forces the engineer to commit expected deltas before seeing results. * **M3 — Validity metrics**: adapter declares no validity metric (no Streetlight Effect). * **M6 — Seeded selection**: `seed:` is `None` (no cherry-picking). * **M7 — Contamination check**: adapter has not declared `data_contamination_checked = True`. All violations surface in a single exception so the engineer fixes everything in one pass, not one-fix-rerun-discover-next. ### Report-validation (before the report is emitted) `IntegrityGuard.report_validation` refuses to publish a report if: * **M3** — Not every adapter-declared metric is in the report * **M4** — Per-stratum breakdown missing or contains only `all` (no aggregate-only reporting) * **M5** — Raw per-case artifacts directory missing * **M9** — `negative_results` is empty * **M10** — `coi_disclosure` is empty * **M1** — Pre-registration path not carried into the report ### Two more mechanisms are operational, not code-enforced * **M8 — External replication** of ≥1 cell by a third party before public claim * **M11 — Blinded LLM-as-judge calibration** (BDIL Phase B; tracked separately) ## Cost tracking The framework registers a usage hook on `core/llm/llm_client.py`'s `LLMClient`, `OpenAILLMClient`, and `BedrockLLMClient`. Every successful LLM call feeds `(model, tokens_in, tokens_out)` into a `CostTracker`. The tracker enforces the configured `cost_budget_usd` as a hard cap — the next call that would exceed budget raises `CostBudgetExceeded` and the runner halts cleanly with a partial-completion report. Per-cell `tokens_in / tokens_out / cost_usd` is currently 0 (aggregate cost is correct; per-cell delta capture is a follow-up). Total run cost in `report.json` is honest. ## Metrics Paper's 13 deterministic metrics plus 3 framework-added validity metrics: | Family | Metric | Source | | -------------------- | ---------------------------------------------------------------------------- | ---------------------------------- | | Outcome | `a1, a3, tcr, exact, in_order, any_order` | Paper § 4.2.1 | | Process — alignment | `rel, cov` | Paper § 4.2.2 | | Process — efficiency | `steps, mtti` | Paper § 4.2.2 | | Process — robustness | `iac, rar, ztdr` | Paper § 4.2.2 | | Validity | `citation_grounding_rate, entity_existence_rate, kubectl_actionability_rate` | Framework (regex + universe check) | All 16 metrics are deterministic (string / set comparison) — no LLM-as-judge at evaluation time. ## Existing production entry points `make test-cloudopsbench` and `opensre tests cloudopsbench` route through `tests/benchmarks/cloudopsbench/run_suite.py`, which is the legacy imperative-CLI surface. The framework runner is the new YAML-config surface and coexists with it during the transition. Both call into the same adapter, scoring code, and replay backend. ## Reference * Paper: Wang et al, *Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems*, [arXiv:2603.00468v1](https://arxiv.org/abs/2603.00468), 28 Feb 2026 — [GitHub](https://github.com/LLM4Ops/Cloud-OpsBench) * HF dataset: [`tracer-cloud/cloud-ops-bench-dataset`](https://huggingface.co/datasets/tracer-cloud/cloud-ops-bench-dataset) * Framework source: [`tests/benchmarks/_framework/`](https://github.com/Tracer-Cloud/opensre/tree/main/tests/benchmarks/_framework) * Adapter source: [`tests/benchmarks/cloudopsbench/`](https://github.com/Tracer-Cloud/opensre/tree/main/tests/benchmarks/cloudopsbench)