🖥️

Real incidents need a real screen.

Open senioreng.dev on your laptop for the full experience.

Learn

Curated reading for production engineers.

Junior

GitLab Database Deletion Incident

An engineer accidentally ran rm -rf on the wrong production database during a stressful incident, losing hours of data.

about.gitlab.com·Postmortem
Read →
Junior

Post-Mortems Collection

A community-maintained list of 100+ public postmortems from major companies — the single best resource for studying real incidents.

github.com·Debugging
Read →
Junior

Postmortem Culture: Learning from Failure

Google's framework for writing blameless postmortems that drive systemic improvements rather than assigning blame to individuals.

sre.google·On-Call & SRE
Read →
Junior

PagerDuty Incident Response Guide

PagerDuty's open-source guide covering incident commander roles, communication templates, escalation paths, and postmortem practices.

response.pagerduty.com·On-Call & SRE
Read →
Junior

Google Engineering Practices: Code Review

Google's complete guide for both reviewers and authors — what to look for, how to communicate feedback, and how fast to respond.

google.github.io·Code Review
Read →
Junior

How to Make Good Code Reviews Better

Practical advice on what separates okay code reviews from ones that genuinely improve code quality and team culture.

stackoverflow.blog·Code Review
Read →
Mid-level

Cloudflare July 2, 2019 Outage

A single regex in their WAF caused 100% CPU spike globally, taking down 13.5 million websites for 27 minutes.

blog.cloudflare.com·Postmortem
Read →
Mid-level

Slack's Outage on January 4th, 2021

A database cache invalidation on the first working day of the new year cascaded into a full outage for millions of users.

slack.engineering·Postmortem
Read →
Mid-level

GitHub October 2018 Incident Analysis

A 43-second network partition caused MySQL replicas to diverge, leading to 24 hours of inconsistency and widespread service degradation.

github.blog·Postmortem
Read →
Mid-level

Principles of Chaos Engineering

The foundational manifesto for chaos engineering — deliberately injecting failures in production to build confidence before real failures do it for you.

principlesofchaos.org·Debugging
Read →
Mid-level

Being On-Call

How Google structures on-call rotations, escalation policies, and manages the psychological impact of production responsibility.

sre.google·On-Call & SRE
Read →
Mid-level

On Call — Increment Issue 28

A collection of essays from engineers at Stripe, PagerDuty, and others on the human and technical sides of being on-call.

increment.com·On-Call & SRE
Read →
Mid-level

Timeouts, Retries, and Backoff with Jitter

How AWS engineers implement timeouts, retries, and exponential backoff with jitter to prevent retry storms in distributed systems.

aws.amazon.com·Reliability
Read →
Mid-level

How to Do Code Reviews Like a Human

Michael Lynch's guide to giving code review feedback that is kind, specific, and actionable — without sounding robotic or condescending.

mtlynch.io·Code Review
Read →
Senior

Facebook BGP Outage — October 2021

A BGP configuration change withdrew all Facebook routes from the internet and locked engineers out of the tools needed to fix it.

engineering.fb.com·Postmortem
Read →
Senior

Incident Response — The SRE Workbook

Google's practical guide to managing incidents — roles, escalation, communication, and what separates a well-run response from a chaotic one.

sre.google·Debugging
Read →
Senior

Avoiding Overload in Distributed Systems

How AWS uses load shedding, admission control, and graceful degradation to prevent cascading failures at global scale.

aws.amazon.com·Reliability
Read →
Senior

Leader Election in Distributed Systems

How distributed systems safely agree on a single leader, and why getting this wrong causes split-brain failures and data corruption.

aws.amazon.com·Reliability
Read →