🖥️

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Live·00:00elapsed

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

89%

avg · 1h

100%75%50%25%0%
1%
1%
4%
4%
12%
12%
89%
89%
89%
89%
89%
89%
10:2010:2610:3510:44

P95 Latency ms

7.2s

avg · 1h

14.2s10.7s7.1s3.6s0
45ms
45ms
80ms
80ms
280ms
280ms
14.2s
14.2s
14.2s
14.2s
14.2s
14.2s
10:2010:2610:3510:44

Request Volume k/min

11k

avg · 1h

15k11k7k4k0
10k
10k
11k
10k
11k
10k
11k
10k
11k
11k
10k
11k
10:2010:2610:3510:44

Success Rate %

11%

avg · 1h

100%75%50%25%0%
99%
99%
96%
96%
88%
88%
11%
11%
11%
11%
11%
11%
10:2010:2610:3510:44

provisioning-service

Error rate went from 12% to 89% immediately after rollback

CRITICAL

Error rate before rollback

12%

Error rate after rollback

89%

Baseline error rate

0.4%

The rollback made things significantly worse. Rolling back again could make it even worse.

Error Rate Progression

Baseline (v2.8.0)Before 10:22 UTC
0.4%

Normal operation

v2.8.1 deployed10:26 UTC
12%

Config parser bug — 12% of requests failing

After rollback10:44 UTC
89%

Error rate spiked after rollback

Endpoint Error Breakdown

All endpoints hitting provisioning_token column — all affected equally

POST /provision

Primary provision flow — column read on every job

91%

420 req/s

GET /provision/:id

Status check — also reads provisioning_token

88%

180 req/s

PUT /provision/:id/cancel

Cancel flow — partial reads of provision_jobs row

85%

62 req/s

GET /provision/health

Health check — does not touch provision_jobs

0%

12 req/s

Production Incident

The Helpful Rollback

Incident Commander Update

A rollback intended to reduce errors has dramatically worsened the incident and provisioning requests are now failing across the platform.

The provisioning-service is experiencing a severe outage. A recent deployment introduced a bug that increased errors to 12%, prompting an emergency rollback to the previous release.

Immediately after the rollback completed, error rates surged from 12% to 89%. Multiple provisioning workflows are now failing and thousands of engineering teams are unable to provision critical infrastructure successfully.

You are the primary on-call engineer. Investigate what changed during the rollback process, identify why the rollback caused a much larger outage than the original bug, and restore service health before the incident escalates further.