🖥️

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Live·00:00elapsed

Incident Workspace

Service Metrics

Environment

Production

Queue Depth jobs

13k

avg · 1h

23k17k12k6k0
1k
1k
1k
1k
1k
2k
2k
2k
4k
6k
9k
12k
16k
20k
22k
22k
23k
23k
23k
23k
23k
23k
23k
23k
-60m-45m-15mNow

Inbound Job Rate jobs/min

106

avg · 1h

1309865330
100 jobs/min
101 jobs/min
100 jobs/min
100 jobs/min
101 jobs/min
102 jobs/min
100 jobs/min
101 jobs/min
102 jobs/min
103 jobs/min
104 jobs/min
105 jobs/min
106 jobs/min
107 jobs/min
108 jobs/min
108 jobs/min
109 jobs/min
110 jobs/min
111 jobs/min
112 jobs/min
113 jobs/min
115 jobs/min
117 jobs/min
118 jobs/min
-60m-45m-15mNow

Job Completion Rate %

65%

avg · 1h

100%75%50%25%0%
98%
98%
97%
96%
94%
92%
89%
85%
80%
74%
67%
60%
54%
50%
47%
45%
44%
43%
43%
42%
42%
42%
42%
42%
-60m-45m-15mNow

Error Rate %

1%

avg · 1h

5%4%3%1%0%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
-60m-45m-15mNow

Critical Incident

Job Queue Degradation

Incident Commander Update

Queue depth is rising rapidly and approaching SLA breach thresholds. Jobs continue completing successfully, but throughput has dropped significantly.

The task-queue service is experiencing severe degradation. Queue depth has increased from a normal baseline of approximately 1,200 jobs to over 22,800 jobs and continues to grow.

Traffic is only 18% above normal month-end levels, yet completion rates have fallen sharply. No significant errors are being reported, but jobs are spending far longer in the system before completion.

You are the primary on-call engineer. Investigate the available telemetry, identify why queue throughput has collapsed, determine the true root cause, and restore service stability before customer SLAs are breached.