Overview
The Airflow integration enables OpenSRE to investigate DAG failures and extract execution context directly from an Apache Airflow instance. It supports:- DAG run inspection
- Task instance retrieval
- Failure detection
- Evidence collection for RCA generation
Architecture
The Airflow integration participates in the investigation pipeline as follows:- Alert ingestion
- Planner selects relevant tools
- Airflow API is queried
- Evidence is collected
- RCA is generated
Configuration
Required Environment Variables
Setup Example
Start Airflow locally:Investigation Flow
Run the investigation CLI:Capabilities
| Capability | Description |
|---|---|
| List DAG runs | Fetch execution history |
| Get task instances | Inspect task-level failures |
| Detect failures | Identify recent failing runs |
| RCA support | Provide structured evidence for root cause analysis |
Planner Behavior
Whensource = airflow, the planner:
- Prioritizes Airflow-related actions
- Seeds Airflow tools into the action space
- Tool selection is LLM-driven
- Exact ordering may vary between runs
Error Handling
- Per-run failures are isolated — one failing request does not break the loop
- Network/API errors are handled defensively
- Partial evidence is preserved whenever possible
Testing
E2E Tests
Routing Tests
Limitations
- Planner routing is probabilistic (LLM-based)
- Requires a reachable Airflow instance
- No CI-backed Airflow instance by default (local validation required)
Design Notes
- Integration follows the same contract as other sources (Datadog, Grafana, etc.)
- Uses env-based configuration for simplicity
- Avoids introducing hard overrides in planning logic
- Focuses on evidence-driven investigation, not static rules
Future Work
- Stronger tool routing guarantees
- CI-backed disposable Airflow instance for e2e tests
- Deeper DAG dependency analysis
- Richer RCA explanations
Tracer