Incident Workspace
Service Metrics
Environment
Production
Error Rate %
89%
avg · 1h
P95 Latency ms
7.2s
avg · 1h
Request Volume k/min
11k
avg · 1h
Success Rate %
11%
avg · 1h
provisioning-service
Error rate went from 12% to 89% immediately after rollback
Error rate before rollback
12%
Error rate after rollback
89%
Baseline error rate
0.4%
The rollback made things significantly worse. Rolling back again could make it even worse.
Error Rate Progression
Normal operation
Config parser bug — 12% of requests failing
Error rate spiked after rollback
Endpoint Error Breakdown
All endpoints hitting provisioning_token column — all affected equally
POST /provision
Primary provision flow — column read on every job
91%
420 req/s
GET /provision/:id
Status check — also reads provisioning_token
88%
180 req/s
PUT /provision/:id/cancel
Cancel flow — partial reads of provision_jobs row
85%
62 req/s
GET /provision/health
Health check — does not touch provision_jobs
0%
12 req/s