SeniorEng - Learn Real Software Engineering

Error Rate Climbing. Service Degraded

Stratos · provisioning-service

Problem

The provisioning-service is experiencing a severe outage. A recent deployment introduced a bug that increased errors to 12%, prompting an emergency rollback to the previous release. Immediately after the rollback completed, error rates surged from 12% to 89%. Multiple provisioning workflows are now failing and thousands of engineering teams are unable to provision critical infrastructure successfully.

Current error rate

89%

Affected service

provisioning-service

Current version

v2.8.0 (after rollback)

Error before rollback

invalid config format

Error after rollback

column does not exist

Incident started

1 minute after rollback

Affected teams

5,000 engineering teams

Service Details

Cloud

AWS us-east-1

Instances

8 pods (EKS)

Runtime

Java 17

Database

PostgreSQL 14

Cache

Redis 6

Version

v2.8.0 (rolled back)

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

89%

avg · 1h

100%75%50%25%0%

12%

89%

10:2010:2610:3510:44

P95 Latency ms

7.2s

avg · 1h

14.2s10.7s7.1s3.6s0

45ms

80ms

280ms

14.2s

10:2010:2610:3510:44

Request Volume k/min

11k

avg · 1h

15k11k7k4k0

10k

11k

10k

11k

10k

11k

10k

11k

10k

11k

10:2010:2610:3510:44

Success Rate %

11%

avg · 1h

100%75%50%25%0%

99%

96%

88%

11%

10:2010:2610:3510:44

provisioning-service

Error rate went from 12% to 89% immediately after rollback

CRITICAL

Error rate before rollback

12%

Error rate after rollback

89%

Baseline error rate

0.4%

The rollback made things significantly worse. Rolling back again could make it even worse.

Error Rate Progression

Baseline (v2.8.0)Before 10:22 UTC

0.4%

Normal operation

v2.8.1 deployed10:26 UTC

12%

Config parser bug — 12% of requests failing

After rollback10:44 UTC

89%

Error rate spiked after rollback

Endpoint Error Breakdown

All endpoints hitting provisioning_token column — all affected equally

POST /provision

Primary provision flow — column read on every job

91%

420 req/s

GET /provision/:id

Status check — also reads provisioning_token

88%

180 req/s

PUT /provision/:id/cancel

Cancel flow — partial reads of provision_jobs row

85%

62 req/s

GET /provision/health

Health check — does not touch provision_jobs

12 req/s

The Helpful Rollback

Incident Commander Update

A rollback intended to reduce errors has dramatically worsened the incident and provisioning requests are now failing across the platform.

The provisioning-service is experiencing a severe outage. A recent deployment introduced a bug that increased errors to 12%, prompting an emergency rollback to the previous release.

Immediately after the rollback completed, error rates surged from 12% to 89%. Multiple provisioning workflows are now failing and thousands of engineering teams are unable to provision critical infrastructure successfully.

You are the primary on-call engineer. Investigate what changed during the rollback process, identify why the rollback caused a much larger outage than the original bug, and restore service health before the incident escalates further.

Real incidents need a real screen.

Service Metrics

provisioning-service

Error Rate Progression

Endpoint Error Breakdown

The Helpful Rollback