SeniorEng - Learn Real Software Engineering

Orders Placed. Never Arriving

Axiom · order-service · Production

Data LossNo ErrorsHard

Problem

The order-service is facing a critical data integrity incident. Clients possess receipts proving successful order confirmations, but the corresponding records cannot be found in the database. Application metrics appear completely healthy. Error rates remain near zero, latency is normal, and logs consistently show successful writes with no visible failures anywhere in the application stack.

Missing orders

47 confirmations

Affected service

order-service

Application error rate

0.1% (normal)

Client status

Escalated — has receipts

Incident started

12 minutes ago

Data loss window

11:34:01–11:34:03 UTC

Service Details

Cloud

AWS us-east-1

Instances

6 pods (EKS)

Runtime

Python 3.11

Database

PostgreSQL 14

Cache

Redis 7 (async replica)

Version

v3.2.1

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

avg · 1h

5%4%3%2%0%

10:5010:5611:0211:08

P95 Latency ms

48ms

avg · 1h

100ms75ms50ms25ms0

46ms

48ms

47ms

49ms

48ms

47ms

48ms

49ms

47ms

48ms

46ms

48ms

10:5010:5611:0211:08

Request Volume k/min

13k

avg · 1h

15k11k7k4k0

12k

13k

12k

13k

14k

13k

12k

13k

14k

13k

12k

10:5010:5611:0211:08

Success Rate %

99%

avg · 1h

100%75%50%25%0%

99%

100%

99%

100%

99%

100%

99%

100%

99%

10:5010:5611:0211:08

order-service

All metrics nominal — no indication of a problem from application layer

HEALTHY

Orders processed (1h)

2,841

Orders confirmed missing

Errors in last 1h

Incident Scope

Affected time window

11:34:01–11:34:03 UTC

Duration of data loss risk

2 seconds

Orders in window

47 confirmed missing

Orders before window

2,794 — all durable

Orders after failover

Ongoing — durable (same risk)

Client notification

Pending — required

Endpoint Breakdown — Last Hour

POST /confirm-order

2,841 requests

P95

48ms

Errors

GET /order/:id

14,220 requests

P95

12ms

Errors

GET /orders

3,104 requests

P95

34ms

Errors

DELETE /order/:id

218 requests

P95

22ms

Errors

The logs say it succeeded. The database says it never happened. Both are correct.

The Invisible Writes

Incident Commander Update

A hedge fund client reports 47 missing order confirmations despite receiving successful HTTP 200 responses from the platform.

The order-service is facing a critical data integrity incident. Clients possess receipts proving successful order confirmations, but the corresponding records cannot be found in the database.

Application metrics appear completely healthy. Error rates remain near zero, latency is normal, and logs consistently show successful writes with no visible failures anywhere in the application stack.

You are the primary on-call engineer. Investigate the available telemetry, determine how successful writes disappeared without generating errors, identify the true root cause, and prevent further data loss before contractual customer impact escalates.

Real incidents need a real screen.

Service Metrics

order-service

Endpoint Breakdown — Last Hour

The Invisible Writes