🖥️

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Live·00:00elapsed

Incident Workspace

Service Overview

Environment

Production

18 services simultaneously degraded

All 18 services share the same cache cluster. Error messages are consistent across all of them: cache: connection refused, cache: no primary node available, cache: slot assignment error. No individual service has a deployment in the last 6 hours.

delivery-service

P99: 4.2s

71%

errors

search-service

P99: 3.8s

68%

errors

surge-pricing

P99: 5.1s

74%

errors

fulfillment-service

P99: 4.0s

69%

errors

notification-service

P99: 3.6s

66%

errors

payment-router

P99: 4.4s

72%

errors

order-tracking

P99: 3.9s

67%

errors

geo-service

P99: 3.5s

65%

errors

driver-dispatch

P99: 4.7s

73%

errors

marketplace-api

P99: 4.1s

70%

errors

catalog-service

P99: 3.4s

64%

errors

analytics-writer

P99: 3.3s

63%

errors

rate-limiter

P99: 5.3s

75%

errors

recommendation-engine

P99: 3.7s

66%

errors

route-optimizer

P99: 4.0s

69%

errors

session-service

P99: 4.3s

71%

errors

event-bus

P99: 3.2s

62%

errors

fraud-detection

P99: 3.8s

68%

errors

Error pattern — consistent across all 18 services

[ERROR] cache: no primary node available for slot 4821

[ERROR] cache: connection refused — node 10.0.1.4:6379

[ERROR] cache: slot assignment error — cluster topology changed mid-request

[ERROR] cache: no primary node available for slot 9102

[ERROR] cache: MOVED redirect to node that is no longer primary

[ERROR] cache: connection refused — node 10.0.1.7:6379

P1 · Production Incident

Cache Errors. 18 Services Down.

PagerDuty · 02:34 UTC

ALERT: cache-cluster errors elevated across all dependent services. 18 teams paged. Error rates 65–75% platform-wide.

18 services have simultaneously started failing with cache errors. Error rates jumped from near-zero to ~70% within minutes. The errors are consistent across all services: connection failures and slot assignment errors.

No individual service has a recent deployment. The cache cluster has been running for 47 days without incident.

You are the on-call engineer. Investigate the available telemetry, identify the root cause, and restore service.