SeniorEng - Learn Real Software Engineering

API Degradation. Midnight Error Spike

Pulse Analytics · dashboard-service · Production

CacheMedium

Problem

Service was healthy all day. At exactly 00:01 UTC, error rate spiked to 67%, P99 latency jumped from 80ms to 45 seconds. A deployment happened at 23:50 UTC — 11 minutes before the spike.

Error rate

67% (ongoing)

P99 latency

80ms → 45 seconds

Affected service

dashboard-service

Users affected

800k daily active

Service Details

Cloud

AWS us-east-1

Instances

200 app servers

Runtime

Node.js 18

Database

PostgreSQL 15

Cache

Redis 7 (ElastiCache)

Version

v8.1.0

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

12%

avg · 1h

100%75%50%25%0%

67%

-60m-45m-15mNow

P99 Latency ms

avg · 1h

45s34s23s11s0

80ms

45s

-60m-45m-15mNow

Request Volume k req/min

12k

avg · 1h

14k11k7k4k0

12k req/min

11k req/min

12k req/min

13k req/min

12k req/min

11k req/min

12k req/min

13k req/min

12k req/min

11k req/min

13k req/min

12k req/min

11k req/min

12k req/min

13k req/min

12k req/min

11k req/min

12k req/min

13k req/min

12k req/min

11k req/min

-60m-45m-15mNow

Active App Servers

200

avg · 1h

200150100500

200 servers

-60m-45m-15mNow

API Degradation. Midnight Error Spike

On-Call Alert

dashboard-service error rate spiked to 67% at exactly 00:01 UTC. A deployment happened 11 minutes ago.

Pulse Analytics serves 800k daily active users. The service was healthy all day — then at 00:01 UTC exactly, error rate jumped to 67% and P99 latency went from 80ms to 45 seconds. A deployment was made at 23:50 UTC.

You are on-call. Investigate the available telemetry, identify the root cause, and restore service.

Real incidents need a real screen.

Service Metrics

API Degradation. Midnight Error Spike