> ## Documentation Index
> Fetch the complete documentation index at: https://opensre.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# CloudOpsBench benchmark

> Run opensre+LLM against the 452-scenario Cloud-OpsBench corpus (Wang et al, arXiv:2603.00468v1) and compare to published LLM-alone baselines.

## Overview

CloudOpsBench is a 452-scenario Kubernetes root-cause-analysis benchmark
published by Wang et al ([arXiv:2603.00468v1](https://arxiv.org/abs/2603.00468),
Feb 2026). The paper uses a *State Snapshot* paradigm — each fault case is
a frozen JSON repository served via mocked `kubectl`-style tool calls — so
every evaluation is bit-for-bit reproducible and needs no live cluster.

OpenSRE wraps this corpus through a small reusable benchmark framework
that adds cost tracking, integrity guards (pre-registration, per-stratum
reporting, negative results, COI disclosure), per-LLM dispatch with
version pinning, and self-contained markdown + HTML reports. The goal
is to publish the opensre+LLM column against the paper's LLM-alone
baselines on the same scenarios.

```text theme={null}
Paper baseline                 opensre+LLM (this benchmark)
─────────────                  ──────────────────────────
DeepSeek-V3.2  0.73   A@1  →   target 0.78+
GPT-5          0.67   A@1  →   target 0.78+
GPT-4o         0.49   A@1  →   target 0.65+
Claude-4-Sonnet 0.50  A@1  →   target 0.65+
```

## What you need to run it

CloudOpsBench needs **no live infrastructure**. The frozen snapshots are
the environment.

```bash theme={null}
# 1. Python 3.12+ (project standard)

# 2. Benchmark-dedicated LLM API keys — keep separate from production opensre keys
export ANTHROPIC_API_KEY=...        # Claude-4-Sonnet via Anthropic direct
export OPENAI_API_KEY=...           # GPT-5, GPT-4o
export DEEPSEEK_API_KEY=...         # DeepSeek-V3.2

# 3. Pull the corpus (one-time, a few hundred MB)
make download-cloudopsbench-hf
```

You do **not** need: AWS credentials, an EKS cluster, kind/minikube,
Bedrock, GPU, Grafana, Datadog, or Prometheus.

## Quick start

### List adapters

```bash theme={null}
uv run python -m tests.benchmarks._framework.cli list
```

### Validate a config

```bash theme={null}
uv run python -m tests.benchmarks._framework.cli validate \
    tests/benchmarks/configs/cloudopsbench_smoke.yml
```

The config lint catches anti-patterns (`runs_per_case < 3`, missing
`pre_registration_path`, oversized grids, system-path `output_dir`).
Validation returns non-zero on any failure.

### Dev-mode run

`--dev` skips the integrity gates so you can smoke-test the wiring
without writing a pre-registration file. The run ID gets a `dev-`
prefix so dev results can't be silently promoted.

```bash theme={null}
uv run python -m tests.benchmarks._framework.cli run \
    tests/benchmarks/configs/cloudopsbench_smoke.yml --dev
```

### Production run

A production run requires:

* A **pre-registration YAML** at `pre_registration_path` listing per-model
  expected deltas, committed to git before the run starts (integrity
  Mechanism 1)
* `seed:` set in config (Mechanism 6)
* Adapter declaration of `data_contamination_checked = True` (Mechanism 7)
* At least one validity metric declared by the adapter (Mechanism 3)

```bash theme={null}
uv run python -m tests.benchmarks._framework.cli run \
    tests/benchmarks/configs/cloudopsbench_v1.yml
```

On completion, the run directory contains `report.json` (machine-readable),
`report.md` (human-readable summary), `report.html` (self-contained, no
external CSS/JS), and `cases/*.json` (per-cell artifacts).

### Re-render an existing report

```bash theme={null}
uv run python -m tests.benchmarks._framework.cli report \
    .bench-results/example/<run-dir>/
```

## Config reference

```yaml theme={null}
benchmark: cloudopsbench

modes:
  - opensre+llm          # opensre wrapping the LLM
  # - llm_alone          # paper provides LLM-alone numbers; rerun only if not trusting them

llms:
  - claude-4-sonnet
  - deepseek-v3.2
  - gpt-5
  - gpt-4o

model_versions:          # pinned to exact provider snapshots
  claude-4-sonnet: claude-sonnet-4-5-20250929
  deepseek-v3.2:   deepseek-chat-v3.2
  gpt-5:           gpt-5-2025-08-07
  gpt-4o:          gpt-4o-2024-11-20

runs_per_case: 3         # replication for variance estimate (Box-Hunter-Hunter Ch 3.4)
workers: 4               # serial across LLMs, parallel within
cost_budget_usd: 1000    # hard cap; run aborts cleanly when exceeded
seed: 42                 # required for reproducible case selection (M6)

filters:                 # optional case subsetting
  systems: [boutique]
  difficulty: [hard, medium]

output_dir: .bench-results/cloudopsbench-v1/
report_formats: [json, markdown, html]
pre_registration_path: tests/benchmarks/configs/preregistrations/v1.yml
```

### Env-var overrides for CI

These let CI override knobs without editing the YAML:

| Variable                        | Purpose                     |
| ------------------------------- | --------------------------- |
| `OPENSRE_BENCH_WORKERS`         | Override `workers:`         |
| `OPENSRE_BENCH_COST_BUDGET_USD` | Override `cost_budget_usd:` |

## Integrity guarantees

The framework enforces 11 honest-results mechanisms at the code level.
There is no bypass short of editing the framework itself.

### Pre-flight (before any case runs)

`IntegrityGuard.pre_flight` raises `IntegrityViolation` if any of these
hold:

* **M1 — Pre-registration**: `pre_registration_path` unset, missing, or
  empty. Forces the engineer to commit expected deltas before seeing results.
* **M3 — Validity metrics**: adapter declares no validity metric (no
  Streetlight Effect).
* **M6 — Seeded selection**: `seed:` is `None` (no cherry-picking).
* **M7 — Contamination check**: adapter has not declared
  `data_contamination_checked = True`.

All violations surface in a single exception so the engineer fixes
everything in one pass, not one-fix-rerun-discover-next.

### Report-validation (before the report is emitted)

`IntegrityGuard.report_validation` refuses to publish a report if:

* **M3** — Not every adapter-declared metric is in the report
* **M4** — Per-stratum breakdown missing or contains only `all` (no
  aggregate-only reporting)
* **M5** — Raw per-case artifacts directory missing
* **M9** — `negative_results` is empty
* **M10** — `coi_disclosure` is empty
* **M1** — Pre-registration path not carried into the report

### Two more mechanisms are operational, not code-enforced

* **M8 — External replication** of ≥1 cell by a third party before public
  claim
* **M11 — Blinded LLM-as-judge calibration** (BDIL Phase B; tracked
  separately)

## Cost tracking

The framework registers a usage hook on `core/llm/llm_client.py`'s
`LLMClient`, `OpenAILLMClient`, and `BedrockLLMClient`. Every successful
LLM call feeds `(model, tokens_in, tokens_out)` into a `CostTracker`.
The tracker enforces the configured `cost_budget_usd` as a hard cap —
the next call that would exceed budget raises `CostBudgetExceeded` and
the runner halts cleanly with a partial-completion report.

Per-cell `tokens_in / tokens_out / cost_usd` is currently 0 (aggregate
cost is correct; per-cell delta capture is a follow-up). Total run cost
in `report.json` is honest.

## Metrics

Paper's 13 deterministic metrics plus 3 framework-added validity metrics:

| Family               | Metric                                                                       | Source                             |
| -------------------- | ---------------------------------------------------------------------------- | ---------------------------------- |
| Outcome              | `a1, a3, tcr, exact, in_order, any_order`                                    | Paper § 4.2.1                      |
| Process — alignment  | `rel, cov`                                                                   | Paper § 4.2.2                      |
| Process — efficiency | `steps, mtti`                                                                | Paper § 4.2.2                      |
| Process — robustness | `iac, rar, ztdr`                                                             | Paper § 4.2.2                      |
| Validity             | `citation_grounding_rate, entity_existence_rate, kubectl_actionability_rate` | Framework (regex + universe check) |

All 16 metrics are deterministic (string / set comparison) — no LLM-as-judge
at evaluation time.

## Existing production entry points

`make test-cloudopsbench` and `opensre tests cloudopsbench` route through
`tests/benchmarks/cloudopsbench/run_suite.py`, which is the legacy
imperative-CLI surface. The framework runner is the new YAML-config
surface and coexists with it during the transition. Both call into the
same adapter, scoring code, and replay backend.

## Reference

* Paper: Wang et al, *Cloud-OpsBench: A Reproducible Benchmark for
  Agentic Root Cause Analysis in Cloud Systems*,
  [arXiv:2603.00468v1](https://arxiv.org/abs/2603.00468), 28 Feb 2026 —
  [GitHub](https://github.com/LLM4Ops/Cloud-OpsBench)
* HF dataset: [`tracer-cloud/cloud-ops-bench-dataset`](https://huggingface.co/datasets/tracer-cloud/cloud-ops-bench-dataset)
* Framework source: [`tests/benchmarks/_framework/`](https://github.com/Tracer-Cloud/opensre/tree/main/tests/benchmarks/_framework)
* Adapter source: [`tests/benchmarks/cloudopsbench/`](https://github.com/Tracer-Cloud/opensre/tree/main/tests/benchmarks/cloudopsbench)
