๐Ÿ–ฅ๏ธ

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Liveยท00:00elapsed

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

14%

avg ยท 1h

100%75%50%25%0%
1%
1%
1%
22%
22%
1%
22%
1%
22%
22%
1%
22%
1%
22%
22%
22%
1%
22%
22%
1%
22%
22%
22%
22%
โˆ’60mโˆ’30mโˆ’15mNow

P95 Latency ms

1,923ms

avg ยท 1h

5000ms3750ms2500ms1250ms0
184ms
190ms
220ms
310ms
480ms
840ms
1.6s
2.8s
4.2s
184ms
200ms
260ms
380ms
580ms
920ms
1.6s
2.4s
3.6s
4.2s
4.2s
4.2s
4.2s
4.2s
4.2s
โˆ’60mโˆ’30mโˆ’15mNow

Request Volume k req/min

7k

avg ยท 1h

20k15k10k5k0
12k/min
12k/min
11k/min
11k/min
10k/min
9k/min
7k/min
4k/min
1k/min
0k/min
11k/min
11k/min
11k/min
10k/min
9k/min
8k/min
6k/min
3k/min
1k/min
0k/min
10k/min
9k/min
7k/min
4k/min
โˆ’60mโˆ’30mโˆ’15mNow

Heap Memory %

86%

avg ยท 1h

100%75%50%25%0%
82%
83%
83%
84%
84%
84%
85%
85%
85%
86%
86%
86%
87%
87%
87%
87%
87%
87%
87%
87%
87%
87%
87%
87%
โˆ’60mโˆ’30mโˆ’15mNow

email-service

Third OOM restart in 18 hours โ€” same pattern each time

CRITICAL

Pod Restarts (18h)

3

Crash Interval

~6 hours

Next OOM Est.

~4 mins

Production Incident

Email Service Failure

Incident Commander Update

Enterprise customers are reporting intermittent email delivery failures.

The email-service has crashed three times in the last 18 hours due to Out Of Memory (OOM) failures.

Each restart temporarily restores service, but memory usage steadily increases until the service crashes again. Error rates spike during every restart window, causing email delivery failures for enterprise customers.

You are the primary on-call engineer. Investigate the available telemetry, identify the root cause of the recurring OOM crashes, and restore service stability before the next failure occurs.