🖥️

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Live·00:00elapsed

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

11%

avg · 1h

34%26%17%9%0%
1%
1%
1%
1%
2%
4%
7%
11%
17%
23%
29%
34%
13:4513:5514:0214:08

P95 Latency ms

3.7s

avg · 1h

12.4s9.3s6.2s3.1s0
82ms
82ms
83ms
85ms
140ms
320ms
1.2s
3.8s
6.4s
8.8s
10.8s
12.4s
13:4513:5514:0214:08

Request Volume k/min

13k

avg · 1h

15k11k7k4k0
12k
12k
13k
12k
13k
12k
13k
12k
13k
14k
14k
14k
13:4513:5514:0214:08

Success Rate %

89%

avg · 1h

100%75%50%25%0%
99%
99%
99%
99%
98%
96%
93%
89%
83%
77%
71%
65%
13:4513:5514:0214:08

auth-service

Incident started 14 minutes ago

CRITICAL
Error Rate
P95 Latency
Last 15 minutes
13:5514:0014:0514:10

🛠️ Incident Mitigations

Choose operational mitigations and debugging actions. Every decision consumes time and affects the incident.

Investigate first

Check at least 3 data points on the left panel before taking any mitigations. Acting without data makes incidents worse.

Production Incident

Authentication Service Outage

Incident Commander Update

Login and authentication failures are increasing rapidly across customer-facing applications.

The auth-service is experiencing a critical outage. Error rates have climbed above 34%, request latency has increased from milliseconds to more than 12 seconds, and authentication requests are timing out throughout the platform.

Customer login attempts are failing, worker threads appear heavily blocked, and autoscaling has not improved service health despite additional capacity being provisioned.

You are the primary on-call engineer. Investigate the latest deployment, analyze traces and runtime behavior, identify the true root cause of the failure, and restore authentication services before the outage spreads further.