🖥️

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Live·00:00elapsed

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

10%

avg · 1h

31%23%16%8%0%
1%
1%
1%
1%
2%
3%
5%
8%
14%
22%
28%
31%
09:3009:4009:5009:55

P95 Latency ms

2.1s

avg · 1h

8.4s6.3s4.2s2.1s0
45ms
45ms
45ms
50ms
80ms
180ms
420ms
1.1s
2.8s
5.4s
7.2s
8.4s
09:3009:4009:5009:55

Request Volume k/min

12k

avg · 1h

15k11k7k4k0
11k
11k
12k
11k
12k
11k
12k
13k
12k
13k
13k
14k
09:3009:4009:5009:55

Success Rate %

90%

avg · 1h

100%75%50%25%0%
99%
99%
99%
99%
98%
97%
95%
92%
86%
78%
72%
68%
09:3009:4009:5009:55

order-service

Incident started 9 minutes ago

CRITICAL
Error Rate
P95 Latency
Last 15 minutes
09:4009:4509:5009:55

Production Incident

Order Service Failure

Incident Commander Update

Checkout requests are failing at an increasing rate and customer orders are no longer being processed reliably.

The order-service is experiencing a major production incident. Error rates have climbed above 30%, request latency has increased from milliseconds to several seconds, and order creation requests are frequently timing out.

Customers attempting to place orders are encountering failures during checkout. Autoscaling has already been triggered, but service performance continues to degrade despite healthy infrastructure metrics.

You are the primary on-call engineer. Examine the available telemetry, determine what changed in the latest deployment, identify the underlying bottleneck, and restore normal order processing before revenue impact grows further.