Compliance-Oriented Incident Command

Incident command for teams that
answer to compliance.

Monitoring tools detect failures.
AlertEngine records how humans respond to them.

Authorized. Audited. Replayable. Nothing executes without your explicit approval — and every decision is recorded for auditors.

📋

Policy

Deterministic rules — no AI

→

🧠

Diagnose

AI explains root cause

→

💬

Alert

WhatsApp or Telegram

→

✅

Authorize

Engineer taps approve

→

⚡

Execute

Your webhook runs

→

📒

Audit

Immutable ledger entry

$ pip install fastapi-alertengine

View on GitHub Request managed pilot

No autonomous remediation No alert fatigue No silent failures No dashboards required

Governance Model

Policy is the floor. AI is the ceiling.

The hierarchy is enforced by architecture, not convention. Claude cannot trigger a state transition. Policy can override Claude. The audit log proves every decision.

Detection

Deterministic policy rules evaluate health score, P95 latency, and error rate. No AI involved. Policy decides whether an incident exists.

Actor: policy

Diagnosis

Two AI models independently analyze the incident. If they agree, one clean alert. If they diverge, a Dissent Alert shows both theories.

Actor: claude

Authorization

Engineer receives WhatsApp or Telegram alert. Taps approve on a JWT-signed, single-use recovery link. Nothing executes without this step.

Actor: engineer

Execution

Orchestrator calls your recovery webhook. 3 retries with exponential backoff. Dead Letter Queue on failure. You control what the webhook does.

Actor: orchestrator

Audit

Every stage, every actor, every confidence score, every policy version — written to an append-only Redis log. Replayable from the ledger alone.

Actor: system

🚨 Checkout API degraded

Health score: 23/100 | P95: 2.8s | Errors: 19%

Diagnosis

Both models agree — confidence: 87%

Database connection pool exhausted after query timeout change

Recent deployment (Diff-in-Pocket)

3m ago — a1b2c3d: "Fix checkout query isolation level" (+12/-3)

⚠️ This commit touched database/query files

Suggested fix

Restart checkout worker pool

[ Approve fix ]

Diagnostic Council

Two models. One verdict — or a Dissent Alert.

When two AI specialists disagree, the disagreement is more valuable than either answer alone. AlertEngine surfaces it before you approve anything.

Models agree

Consensus Alert

Both models independently reached the same diagnosis. You receive one clean alert with combined confidence.

⚡ Action Recommended

Score: 23 | P95: 2.8s

Issue: Database pool exhausted

Confidence: 87% (both models agree)

👉 Approve fix: [link]

Models disagree

Dissent Alert

The models reached different conclusions. You see both theories, specific logs to check, and two approve paths. The disagreement prevents false confidence.

⚠️ Degraded State — Models Disagree

Score: 23 | P95: 2.8s

Theory A (Database): Pool exhausted — 82%

Check: DB slow query log

Theory B (Network): Upstream timeout — 76%

Check: Upstream response times

Investigate before approving.

👉 Trust A 👉 Trust B

How it works

Six stages. Full attribution at every step.

From anomaly detection to authorized recovery — entirely through your phone. Every actor logged.

Instrument

Add instrument(app). P95 latency, error rate, and health scoring start immediately. Free SDK, MIT licensed.

Detect

Orchestrator polls /health/alerts every 5s. Policy gates run first — deterministic, no AI. Incident opens when thresholds breach.

Diagnose

Two AI models analyze independently. Commit context injected (Diff-in-Pocket). Dissent alert if models diverge. Confidence-gated — no noise.

Authorize

WhatsApp or Telegram alert arrives. Plain English diagnosis. Tap approve. JWT-signed, single-use, 5-minute TTL. Nothing runs without this.

Execute

Your recovery webhook is called. 3 retries with exponential backoff. DLQ on failure. Orchestrator never touches your servers directly.

Free SDK

Free Forever · MIT Licensed

Local Incident Sensing — Free Forever

Everything you need to understand what your API is doing right now. No account. No cloud. No catch — until you need alerts.

✓
P95 Latency Tracking — real percentiles, not averages
✓
Error Rate Detection — 4xx/5xx with configurable thresholds
✓
Health Score 0–100 — composite, trend-aware
✓
Anomaly Scoring — detects spikes vs your baseline
✓
/health/alerts Endpoint — clean JSON, AI-agent friendly
✓
Memory Fallback — Redis optional, never crashes your app
✓
MIT Licensed — use it however you like

The catch: You see the score drop. You don't know why. You don't get alerts. You don't get recovery links.
That's the orchestrator.

# Install
pip install fastapi-alertengine

# In your FastAPI app
from fastapi import FastAPI
from fastapi_alertengine import instrument

app = FastAPI()
instrument(app)  # that's it

# Now visit /health/alerts
# {
#   "status": "critical",
#   "health_score": {"score": 23, "trend": "degrading"},
#   "metrics": {
#     "overall_p95_ms": 2847.3,
#     "error_rate": 0.19,
#     "anomaly_score": 1.4
#   },
#   "alerts": [...]
# }
        

Channels

Alerts where engineers actually are.

No new apps to install. No dashboards to check. Just the channel your team already uses.

💬

Via Twilio or Sent.dm. The most reliable mobile interrupt channel globally. Recovery approvals arrive as tappable links.

Growth plan+

✈️

Via Telegram Bot API. Available on all plans including Starter. No per-message cost. Instant delivery globally.

All plans

Slack

Webhook-based Slack integration for team notifications. Incidents posted to your channel with recovery link.

Compliance plan+

🔗

Webhook

Generic HTTP webhook fallback. Fires when primary channel fails. Integrates with any endpoint — PagerDuty, Teams, custom.

All plans

📞

Voice

Automated voice call escalation via Twilio. Fires after 180s if no approval received. Secondary engineer notified after 300s.

Compliance plan+

📒

Audit Ledger

Every delivery attempt logged immutably. Success, failure, provider, actor, timestamp. Full ledger per incident.

All plans

Pricing

From awareness to evidence.

The SDK is free forever. As your compliance requirements grow, so does what AlertEngine proves. Every plan includes unlimited team users.

Free

Detection SDK only. MIT licensed. Runs on your servers. See the score drop — but not why, and not on your phone.

You see it. You don't get told.

pip install

Starter

$19/mo

Operational Awareness

1 service. 5 incidents/mo. Telegram alerts. Know when your app breaks — before your users tell you.

One hour of downtime costs more than a year of Starter.

Get started

Growth

$99/mo

Diagnostic Intelligence

1 service. 10 incidents/mo. WhatsApp + AI diagnosis. Know what broke, not just that it broke.

One false-positive 3am alert costs more than a month of Growth.

Get started

Popular

Team

$299/mo

Diagnostic Intelligence

3 services. 50 incidents/mo. Diagnostic Council — two AI models in adversarial deliberation. Dissent alerts when models disagree.

$6 per incident for AI diagnosis + human authorization + full audit trail.

Get started

Compliance

$799/mo

Regulatory-Grade Auditability

10 services. 200 incidents/mo. Every incident logged with actor, policy version, and decision. Export your audit trail. Prove compliance to auditors.

SOC 2 Type II audit costs $15K–$50K. Compliance is $799/mo insurance against that delay.

Get started

Platform

$1,500/mo

Regulatory-Grade Auditability

20 services. 1,000 incidents/mo. Custom policy thresholds versioned and logged in every audit entry. Built for platforms that answer to regulators.

Generic thresholds don't work at scale. Custom thresholds become compliance evidence.

Get started

Need dedicated deployment, custom SLA, or procurement paperwork?

Compliance

Every principle enforced by code. Every claim provable by audit.

AlertEngine is designed for teams where operational decisions must be documented and defensible.

Principle	Enforcement	Audit proof
Policy decides incidents, not AI	`should_recover()` in pipeline.py gates RECOVERED	`actor: "policy"` in audit log
AI explains, humans authorize	Claude generates message; JWT gates execution	`actor: "claude"` then `actor: "engineer"`
Nothing executes without approval	POST /action/recover/confirm requires valid JWT	AUTHORIZED before EXECUTED in every log
Every action logged immutably	append_event() on every transition	get_audit_log() returns complete timeline
Deterministic alert rules	incident_policy.py — single versioned POLICY dict	`policy_version` in every audit entry
Cross-tenant isolation	Tenant ID validated on every endpoint	403 on mismatch — adversarial audit confirmed
Replay attack prevention	Atomic Redis SET NX, single-use JWT	20 concurrent attempts → exactly 1 succeeded

Safety

Human-Authorized. Always.

No automated remediation. No background execution. Every recovery action requires explicit human authorization.

🔑

JWT Recovery Tokens

Every recovery action is gated by a tenant-scoped JWT with a 5-minute TTL. Tokens are single-use and validated atomically in Redis — no replay possible.

👁

Preview Before Authorization

GET the recovery link to see exactly what will happen. POST to execute. The preview is read-only and irreversible actions are always a separate, explicit step.

🔒

Cross-Tenant Isolation

All endpoints enforce tenant ownership. Adversarial audit confirmed: attempting to access another tenant's incidents returns 403 — always.

📒

Immutable Audit Ledger

Every alert, diagnosis, delivery attempt, and recovery authorization is written to an append-only log with full actor attribution. State is reconstructable from the ledger alone.

Security

Survived a full adversarial audit.

An autonomous AI agent acted as a hostile tenant and attempted to break isolation, replay tokens, and flood the system. 10/10 passed.

Check	Result	Detail
Cross-tenant audit access	✓ Blocked	403 returned
Cross-tenant delivery access	✓ Blocked	403 returned
Recovery token replay (20 concurrent)	✓ Protected	1 succeeded, 19 rejected
Duplicate incident creation (race)	✓ Protected	Exactly 1 created
Concurrent token flood	✓ Handled	Atomic Redis SET NX
Natural incident detection	✓ Confirmed	End-to-end verified
WhatsApp delivery	✓ Confirmed	Live production delivery
Recovery authorization audit trail	✓ Written	Immutable append-only log
Degraded mode handling	✓ Confirmed	NORMAL/DEGRADED/EMERGENCY
Lease renewal under load	✓ Atomic	Lua compare-and-delete

Open Source

Source-available orchestrator. MIT-licensed SDK.

Clean separation between the free SDK and the paid orchestrator. The orchestrator is published for security audit — not for self-hosting.

fastapi_alertengine/ ← Free PyPI package — MIT licensed

middleware.py ← RequestMetricsMiddleware

engine.py ← Core alert engine

intelligence.py ← Adaptive thresholds, health scoring

storage.py ← Redis Streams persistence

orchestrator/ ← Source-available for audit — NOT for self-hosting

pipeline.py ← State machine + IncidentStage enum

incident_policy.py ← Single source of truth for all thresholds

claude_engine.py ← AI diagnosis (tool use, hardened)

diagnostic_council.py ← Dual-model incident court

commit_context.py ← Diff-in-Pocket commit correlation

audit.py ← Immutable forensic ledger

plans.py ← Billing tiers and feature gates

tests/ ← 232 tests, Python 3.10/3.11/3.12

docs/ ← This landing page + ARCHITECTURE.md

🇿🇼

Built in Zimbabwe — where the constraint became the feature.

I spent my career in accounting and finance before building AlertEngine. In finance, no transaction executes without authorization and every action leaves an audit trail. AlertEngine applies that same discipline to production infrastructure.

In Zimbabwe, engineers aren't always at laptops when things break. WhatsApp is the operational control plane. That constraint produced something better than a dashboard ever could.

10/10

Adversarial audit checks passed including replay attacks and cross-tenant isolation

Live

Live fintech tenant monitored in production — real workloads, real incidents

232

Tests passing across the full SDK and orchestration suite

Detection latency from spike to WhatsApp alert

Incident command for teams thatanswer to compliance.

Policy is the floor. AI is the ceiling.

Detection

Diagnosis

Authorization

Execution

Audit

Two models. One verdict — or a Dissent Alert.

Consensus Alert

Dissent Alert

Six stages. Full attribution at every step.

Instrument

Detect

Diagnose

Authorize

Execute

Local Incident Sensing — Free Forever

Alerts where engineers actually are.

WhatsApp

Telegram

Slack

Webhook

Voice

Audit Ledger

From awareness to evidence.

Every principle enforced by code. Every claim provable by audit.

Human-Authorized. Always.

JWT Recovery Tokens

Preview Before Authorization

Cross-Tenant Isolation

Immutable Audit Ledger

Survived a full adversarial audit.

Source-available orchestrator. MIT-licensed SDK.

You'll be live within 2 hours.

Policy is the floor. AI is the ceiling. The ledger proves it.

Incident command for teams that
answer to compliance.