SeniorEng - Learn Real Software Engineering

Cache Errors. 18 Services Down.

Problem

18 services are simultaneously reporting cache errors and elevated latency. Error rates across all of them jumped from near-zero to 65–75% within minutes. No single service has a recent deployment.

Services affected

18 (all critical)

Avg error rate

~69%

Incident started

8 minutes ago

Infrastructure

Cache

Redis cluster

Nodes

12 (3 primaries)

Region

us-east-1

Clients

18 services

Cache version

v7.2.4

Uptime

47 days

Incident Workspace

Service Overview

Environment

Production

18 services simultaneously degraded

All 18 services share the same cache cluster. Error messages are consistent across all of them: cache: connection refused, cache: no primary node available, cache: slot assignment error. No individual service has a deployment in the last 6 hours.

delivery-service

P99: 4.2s

71%

errors

search-service

P99: 3.8s

68%

errors

surge-pricing

P99: 5.1s

74%

errors

fulfillment-service

P99: 4.0s

69%

errors

notification-service

P99: 3.6s

66%

errors

payment-router

P99: 4.4s

72%

errors

order-tracking

P99: 3.9s

67%

errors

geo-service

P99: 3.5s

65%

errors

driver-dispatch

P99: 4.7s

73%

errors

marketplace-api

P99: 4.1s

70%

errors

catalog-service

P99: 3.4s

64%

errors

analytics-writer

P99: 3.3s

63%

errors

rate-limiter

P99: 5.3s

75%

errors

recommendation-engine

P99: 3.7s

66%

errors

route-optimizer

P99: 4.0s

69%

errors

session-service

P99: 4.3s

71%

errors

event-bus

P99: 3.2s

62%

errors

fraud-detection

P99: 3.8s

68%

errors

Error pattern — consistent across all 18 services

[ERROR] cache: no primary node available for slot 4821

[ERROR] cache: connection refused — node 10.0.1.4:6379

[ERROR] cache: slot assignment error — cluster topology changed mid-request

[ERROR] cache: no primary node available for slot 9102

[ERROR] cache: MOVED redirect to node that is no longer primary

[ERROR] cache: connection refused — node 10.0.1.7:6379

Cache Errors. 18 Services Down.

PagerDuty · 02:34 UTC

ALERT: cache-cluster errors elevated across all dependent services. 18 teams paged. Error rates 65–75% platform-wide.

18 services have simultaneously started failing with cache errors. Error rates jumped from near-zero to ~70% within minutes. The errors are consistent across all services: connection failures and slot assignment errors.

No individual service has a recent deployment. The cache cluster has been running for 47 days without incident.

You are the on-call engineer. Investigate the available telemetry, identify the root cause, and restore service.

Real incidents need a real screen.

Service Overview

Cache Errors. 18 Services Down.