Open senioreng.dev on your laptop for the full experience.
Curated reading for production engineers.
GitLab Database Deletion Incident
An engineer accidentally ran rm -rf on the wrong production database during a stressful incident, losing hours of data.
Post-Mortems Collection
A community-maintained list of 100+ public postmortems from major companies — the single best resource for studying real incidents.
Postmortem Culture: Learning from Failure
Google's framework for writing blameless postmortems that drive systemic improvements rather than assigning blame to individuals.
PagerDuty Incident Response Guide
PagerDuty's open-source guide covering incident commander roles, communication templates, escalation paths, and postmortem practices.
Google Engineering Practices: Code Review
Google's complete guide for both reviewers and authors — what to look for, how to communicate feedback, and how fast to respond.
How to Make Good Code Reviews Better
Practical advice on what separates okay code reviews from ones that genuinely improve code quality and team culture.
Cloudflare July 2, 2019 Outage
A single regex in their WAF caused 100% CPU spike globally, taking down 13.5 million websites for 27 minutes.
Slack's Outage on January 4th, 2021
A database cache invalidation on the first working day of the new year cascaded into a full outage for millions of users.
GitHub October 2018 Incident Analysis
A 43-second network partition caused MySQL replicas to diverge, leading to 24 hours of inconsistency and widespread service degradation.
Principles of Chaos Engineering
The foundational manifesto for chaos engineering — deliberately injecting failures in production to build confidence before real failures do it for you.
Being On-Call
How Google structures on-call rotations, escalation policies, and manages the psychological impact of production responsibility.
On Call — Increment Issue 28
A collection of essays from engineers at Stripe, PagerDuty, and others on the human and technical sides of being on-call.
Timeouts, Retries, and Backoff with Jitter
How AWS engineers implement timeouts, retries, and exponential backoff with jitter to prevent retry storms in distributed systems.
How to Do Code Reviews Like a Human
Michael Lynch's guide to giving code review feedback that is kind, specific, and actionable — without sounding robotic or condescending.
Facebook BGP Outage — October 2021
A BGP configuration change withdrew all Facebook routes from the internet and locked engineers out of the tools needed to fix it.
Incident Response — The SRE Workbook
Google's practical guide to managing incidents — roles, escalation, communication, and what separates a well-run response from a chaotic one.
Avoiding Overload in Distributed Systems
How AWS uses load shedding, admission control, and graceful degradation to prevent cascading failures at global scale.
Leader Election in Distributed Systems
How distributed systems safely agree on a single leader, and why getting this wrong causes split-brain failures and data corruption.