๐Ÿ–ฅ๏ธ

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

AI is making it easier than ever to write code.

The real skill now is debugging, code reviews and preventing disasters in production.

SeniorEng puts you inside real production incidents and pull requests.

Investigate systems, click through logs, read the diffs and find the bug.

senioreng.dev/incidents/payment-service
Live Incident

Payment Service: Error rate 94% ยท 6 min ago

P0 ยท Medium
09:14:02payment-svc-7d9f: pod started, connecting to DB
09:14:05payment-svc-7d9f: health check OK
09:14:31payment-svc-7d9f: received 142 requests
09:14:38ERROR: connection pool exhausted (pool=20/20)
09:14:38ERROR: timeout acquiring DB connection after 5000ms
09:14:39ERROR: 503 โ€” downstream DB unresponsive
09:14:41WARN: retrying connection (attempt 2/3)
09:14:46FATAL: all retries exhausted โ€” rejecting request

Your Move

Hint available ยท Score tracked

PR #4471: Add pagination to /orders endpoint

Medium ยท 15 min
8 SELECT * FROM orders WHERE user_id = ?
+ LIMIT ? OFFSET ?
9 -- returns paginated results
Is this safe to merge?
ApproveBlock

Problems

Train real-world engineering judgment.

Debugging Problems

How it works

1

You're paged in

An alert fires. Read the incident brief. What's down, what's burning.

2

Investigate on the left sidebar

Logs, metrics, traces, Slack threads, all on the left sidebar. The cause is buried in there. So are the red herrings.

3

Mitigate on the right panel

Take action on the right panel. Stuck? Use a hint. Your score tracks how fast you solved it, and whether you reached for the right lever.

Hard

Hard
Knight Capital20โ€“30 mins

$440M in 45 Minutes

Market open 17 minutes ago. P&L is down $218M at $11M/minute. No errors. All 8 trading servers appear healthy. Inspired by a real incident.

Open โ†’
Hard
20โ€“30 mins

Auth Service Failure

Investigate a production outage in the Auth Service.

Open โ†’
Hard
Meta20โ€“30 mins

All Products Down. Zero External Traffic

All three products dropped to zero external traffic simultaneously. Internal health is green. Inspired by a real incident.

Open โ†’
Hard
20โ€“30 mins

Orders Placed. Never Arriving

Orders are being placed but never appearing. Investigate.

Open โ†’
Hard
20โ€“30 mins

Error Rate Climbing. Service Degraded

Error rate spiked after a recent deploy. Investigation is ongoing.

Open โ†’
Hard
AWS20โ€“30 mins

Object Storage Down. 100% 503s

All object storage operations in us-east-1 returning 503. Downstream services cascading. Inspired by a real incident.

Open โ†’
Hard
20โ€“30 mins

Error Rate Rising to 100%

Error rate was 28%. A fix was attempted. Now it's 100%.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
20โ€“30 mins

OOM Kills Every 18 Hours

Service crashes with OOM every 18 hours. Memory climbs at 180MB/hour. CPU is normal.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
GitHub20โ€“30 mins

Database Inconsistent Writes

Database writes returning inconsistent results across nodes. Inspired by a real incident.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
20โ€“30 mins

Queue Depth Incident

One merchant's payments are 45 minutes delayed. 14 other merchants: instant. Error rate: 0%.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
Cloudflare20โ€“30 mins

19 Datacenters Down. Simultaneously.

A single network config change took 19 Cloudflare PoPs offline at once. 80% of requests failing globally. Inspired by a real incident.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
20โ€“30 mins

Checkout, Cart, Billing. 100% Errors

Three services failed at the same time. The service they depend on recovered 4 minutes ago.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
CrowdStrike20โ€“30 mins

Every Windows Host. Simultaneously.

8.5 million Windows machines BSOD'd simultaneously. No code deploy in 72 hours. macOS and Linux hosts are completely fine. Inspired by a real incident.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
20โ€“30 mins

Cache Errors. 18 Services Down.

18 unrelated services are all failing with cache errors. Memory is at 42%. Replication is fine.

Open โ†’

Code Review Problems

How it works

1

Read the PR

A pull request just landed for review. Go through the changes, understand what's changing and why.

2

Spot the issues

Find the bugs, security holes, and performance traps hiding in the code. The left sidebar has context and Slack threads. Some issues are obvious. Most aren't.

3

Make your verdict

Approve, request changes, or block on the right panel. Hints available if you need them. Your score tracks what you caught and what you missed.

Hard

Hard
20โ€“25 mins

Added Lock to Prevent Duplication

Review a distributed Redis lock PR.

Open โ†’
Hard
20โ€“25 mins

The Retry Fix

Review an idempotent payment retry PR for a payments startup.

Open โ†’
Hard
Cloudflare20โ€“25 mins

WAF Rule: SQL Injection Detection

Review a Lua WAF rule that runs on every HTTP request at the edge. Inspired by a real incident.

Open โ†’
Hard
20โ€“25 mins

Link Preview Cards

Review a link preview feature PR for a B2B messaging app on AWS.

โ™› Premium
๐Ÿ”’ Unlock with Premium
Hard
20โ€“25 mins

Remove legacy_plan_id Migration

Review a schema cleanup PR that removes a deprecated column.

โ™› Premium
๐Ÿ”’ Unlock with Premium