Monitoring tools detect failures.
AlertEngine records how humans respond to them.
Authorized. Audited. Replayable. Nothing executes without your explicit approval β and every decision is recorded for auditors.
The hierarchy is enforced by architecture, not convention. Claude cannot trigger a state transition. Policy can override Claude. The audit log proves every decision.
Deterministic policy rules evaluate health score, P95 latency, and error rate. No AI involved. Policy decides whether an incident exists.
Two AI models independently analyze the incident. If they agree, one clean alert. If they diverge, a Dissent Alert shows both theories.
Engineer receives WhatsApp or Telegram alert. Taps approve on a JWT-signed, single-use recovery link. Nothing executes without this step.
Orchestrator calls your recovery webhook. 3 retries with exponential backoff. Dead Letter Queue on failure. You control what the webhook does.
Every stage, every actor, every confidence score, every policy version β written to an append-only Redis log. Replayable from the ledger alone.
When two AI specialists disagree, the disagreement is more valuable than either answer alone. AlertEngine surfaces it before you approve anything.
Both models independently reached the same diagnosis. You receive one clean alert with combined confidence.
The models reached different conclusions. You see both theories, specific logs to check, and two approve paths. The disagreement prevents false confidence.
From anomaly detection to authorized recovery β entirely through your phone. Every actor logged.
Add instrument(app). P95 latency, error rate, and health scoring start immediately. Free SDK, MIT licensed.
Orchestrator polls /health/alerts every 5s. Policy gates run first β deterministic, no AI. Incident opens when thresholds breach.
Two AI models analyze independently. Commit context injected (Diff-in-Pocket). Dissent alert if models diverge. Confidence-gated β no noise.
WhatsApp or Telegram alert arrives. Plain English diagnosis. Tap approve. JWT-signed, single-use, 5-minute TTL. Nothing runs without this.
Your recovery webhook is called. 3 retries with exponential backoff. DLQ on failure. Orchestrator never touches your servers directly.
Everything you need to understand what your API is doing right now. No account. No cloud. No catch β until you need alerts.
No new apps to install. No dashboards to check. Just the channel your team already uses.
Via Twilio or Sent.dm. The most reliable mobile interrupt channel globally. Recovery approvals arrive as tappable links.
Via Telegram Bot API. Available on all plans including Starter. No per-message cost. Instant delivery globally.
Webhook-based Slack integration for team notifications. Incidents posted to your channel with recovery link.
Generic HTTP webhook fallback. Fires when primary channel fails. Integrates with any endpoint β PagerDuty, Teams, custom.
Automated voice call escalation via Twilio. Fires after 180s if no approval received. Secondary engineer notified after 300s.
Every delivery attempt logged immutably. Success, failure, provider, actor, timestamp. Full ledger per incident.
The SDK is free forever. As your compliance requirements grow, so does what AlertEngine proves. Every plan includes unlimited team users.
Detection SDK only. MIT licensed. Runs on your servers. See the score drop β but not why, and not on your phone.
You see it. You don't get told.
pip install1 service. 5 incidents/mo. Telegram alerts. Know when your app breaks β before your users tell you.
One hour of downtime costs more than a year of Starter.
Get started1 service. 10 incidents/mo. WhatsApp + AI diagnosis. Know what broke, not just that it broke.
One false-positive 3am alert costs more than a month of Growth.
Get started3 services. 50 incidents/mo. Diagnostic Council β two AI models in adversarial deliberation. Dissent alerts when models disagree.
$6 per incident for AI diagnosis + human authorization + full audit trail.
Get started10 services. 200 incidents/mo. Every incident logged with actor, policy version, and decision. Export your audit trail. Prove compliance to auditors.
SOC 2 Type II audit costs $15Kβ$50K. Compliance is $799/mo insurance against that delay.
Get started20 services. 1,000 incidents/mo. Custom policy thresholds versioned and logged in every audit entry. Built for platforms that answer to regulators.
Generic thresholds don't work at scale. Custom thresholds become compliance evidence.
Get startedNeed dedicated deployment, custom SLA, or procurement paperwork?
Contact us β EnterpriseAlertEngine is designed for teams where operational decisions must be documented and defensible.
| Principle | Enforcement | Audit proof |
|---|---|---|
| Policy decides incidents, not AI | should_recover() in pipeline.py gates RECOVERED | actor: "policy" in audit log |
| AI explains, humans authorize | Claude generates message; JWT gates execution | actor: "claude" then actor: "engineer" |
| Nothing executes without approval | POST /action/recover/confirm requires valid JWT | AUTHORIZED before EXECUTED in every log |
| Every action logged immutably | append_event() on every transition | get_audit_log() returns complete timeline |
| Deterministic alert rules | incident_policy.py β single versioned POLICY dict | policy_version in every audit entry |
| Cross-tenant isolation | Tenant ID validated on every endpoint | 403 on mismatch β adversarial audit confirmed |
| Replay attack prevention | Atomic Redis SET NX, single-use JWT | 20 concurrent attempts β exactly 1 succeeded |
No automated remediation. No background execution. Every recovery action requires explicit human authorization.
Every recovery action is gated by a tenant-scoped JWT with a 5-minute TTL. Tokens are single-use and validated atomically in Redis β no replay possible.
GET the recovery link to see exactly what will happen. POST to execute. The preview is read-only and irreversible actions are always a separate, explicit step.
All endpoints enforce tenant ownership. Adversarial audit confirmed: attempting to access another tenant's incidents returns 403 β always.
Every alert, diagnosis, delivery attempt, and recovery authorization is written to an append-only log with full actor attribution. State is reconstructable from the ledger alone.
An autonomous AI agent acted as a hostile tenant and attempted to break isolation, replay tokens, and flood the system. 10/10 passed.
| Check | Result | Detail |
|---|---|---|
| Cross-tenant audit access | β Blocked | 403 returned |
| Cross-tenant delivery access | β Blocked | 403 returned |
| Recovery token replay (20 concurrent) | β Protected | 1 succeeded, 19 rejected |
| Duplicate incident creation (race) | β Protected | Exactly 1 created |
| Concurrent token flood | β Handled | Atomic Redis SET NX |
| Natural incident detection | β Confirmed | End-to-end verified |
| WhatsApp delivery | β Confirmed | Live production delivery |
| Recovery authorization audit trail | β Written | Immutable append-only log |
| Degraded mode handling | β Confirmed | NORMAL/DEGRADED/EMERGENCY |
| Lease renewal under load | β Atomic | Lua compare-and-delete |
Clean separation between the free SDK and the paid orchestrator. The orchestrator is published for security audit β not for self-hosting.
Built in Zimbabwe β where the constraint became the feature.
I spent my career in accounting and finance before building AlertEngine. In finance, no transaction executes without authorization and every action leaves an audit trail. AlertEngine applies that same discipline to production infrastructure.
In Zimbabwe, engineers aren't always at laptops when things break. WhatsApp is the operational control plane. That constraint produced something better than a dashboard ever could.
Fill in your details and we'll configure your tenant, fire a test alert to your phone, and send your invoice. No credit card upfront.
No credit card upfront. Invoice sent after your test alert fires. Pay via Payoneer or wire transfer.
The SDK is free and takes one line. The managed layer is ready when you are.