๐Ÿ–ฅ๏ธ

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Liveยท00:00elapsed

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

1%

avg ยท 1h

5%4%3%2%0%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
1%
10:5010:5611:0211:08

P95 Latency ms

48ms

avg ยท 1h

100ms75ms50ms25ms0
46ms
48ms
47ms
49ms
48ms
47ms
48ms
49ms
47ms
48ms
46ms
48ms
10:5010:5611:0211:08

Request Volume k/min

13k

avg ยท 1h

15k11k7k4k0
12k
13k
12k
13k
14k
13k
12k
13k
14k
13k
13k
12k
10:5010:5611:0211:08

Success Rate %

99%

avg ยท 1h

100%75%50%25%0%
99%
99%
100%
99%
100%
100%
99%
100%
99%
99%
100%
99%
10:5010:5611:0211:08

order-service

All metrics nominal โ€” no indication of a problem from application layer

HEALTHY

Orders processed (1h)

2,841

Orders confirmed missing

47

Errors in last 1h

0

Incident Scope

Affected time window

11:34:01โ€“11:34:03 UTC

Duration of data loss risk

2 seconds

Orders in window

47 confirmed missing

Orders before window

2,794 โ€” all durable

Orders after failover

Ongoing โ€” durable (same risk)

Client notification

Pending โ€” required

Endpoint Breakdown โ€” Last Hour

POST /confirm-order

2,841 requests

P95

48ms

Errors

0

OK

GET /order/:id

14,220 requests

P95

12ms

Errors

0

OK

GET /orders

3,104 requests

P95

34ms

Errors

0

OK

DELETE /order/:id

218 requests

P95

22ms

Errors

0

OK

The logs say it succeeded. The database says it never happened. Both are correct.

Production Incident

The Invisible Writes

Incident Commander Update

A hedge fund client reports 47 missing order confirmations despite receiving successful HTTP 200 responses from the platform.

The order-service is facing a critical data integrity incident. Clients possess receipts proving successful order confirmations, but the corresponding records cannot be found in the database.

Application metrics appear completely healthy. Error rates remain near zero, latency is normal, and logs consistently show successful writes with no visible failures anywhere in the application stack.

You are the primary on-call engineer. Investigate the available telemetry, determine how successful writes disappeared without generating errors, identify the true root cause, and prevent further data loss before contractual customer impact escalates.