SeniorEng - Learn Real Software Engineering

Order Service Failure

Problem

The order-service is experiencing a major production incident. Error rates have climbed above 30%, request latency has increased from milliseconds to several seconds, and order creation requests are frequently timing out. Customers attempting to place orders are encountering failures during checkout. Autoscaling has already been triggered, but service performance continues to degrade despite healthy infrastructure metrics.

Current error rate

31%

Latest deployment

order-service v4.1.2

Incident started

9 minutes ago

Service Details

Cloud

GCP us-central1

Instances

4 pods (GKE)

Runtime

Python 3.10

Database

PostgreSQL 14

Cache

Redis 6

Version

v4.1.2

Incident Workspace

Service Metrics

Environment

Production

Error Rate %

10%

avg · 1h

31%23%16%8%0%

14%

22%

28%

31%

09:3009:4009:5009:55

P95 Latency ms

2.1s

avg · 1h

8.4s6.3s4.2s2.1s0

45ms

50ms

80ms

180ms

420ms

1.1s

2.8s

5.4s

7.2s

8.4s

09:3009:4009:5009:55

Request Volume k/min

12k

avg · 1h

15k11k7k4k0

11k

12k

11k

12k

11k

12k

13k

12k

13k

14k

09:3009:4009:5009:55

Success Rate %

90%

avg · 1h

100%75%50%25%0%

99%

98%

97%

95%

92%

86%

78%

72%

68%

09:3009:4009:5009:55

order-service

Incident started 9 minutes ago

CRITICAL

Error Rate

P95 Latency

Last 15 minutes

09:4009:4509:5009:55

Order Service Failure

Incident Commander Update

Checkout requests are failing at an increasing rate and customer orders are no longer being processed reliably.

Customers attempting to place orders are encountering failures during checkout. Autoscaling has already been triggered, but service performance continues to degrade despite healthy infrastructure metrics.

You are the primary on-call engineer. Examine the available telemetry, determine what changed in the latest deployment, identify the underlying bottleneck, and restore normal order processing before revenue impact grows further.

Real incidents need a real screen.

Service Metrics

order-service

Order Service Failure