Learn

Postmortem Culture: Learning from Failure

Google's framework for writing blameless postmortems that drive systemic improvements rather than assigning blame to individuals.

sre.google·On-Call & SRE

response.pagerduty.com·On-Call & SRE

PagerDuty Incident Response Guide

PagerDuty's open-source guide covering incident commander roles, communication templates, escalation paths, and postmortem practices.

google.github.io·Code Review

Google Engineering Practices: Code Review

Google's complete guide for both reviewers and authors — what to look for, how to communicate feedback, and how fast to respond.

stackoverflow.blog·Code Review

How to Make Good Code Reviews Better

Practical advice on what separates okay code reviews from ones that genuinely improve code quality and team culture.

blog.cloudflare.com·Postmortem

Cloudflare July 2, 2019 Outage

A single regex in their WAF caused 100% CPU spike globally, taking down 13.5 million websites for 27 minutes.

slack.engineering·Postmortem

Slack's Outage on January 4th, 2021

A database cache invalidation on the first working day of the new year cascaded into a full outage for millions of users.

GitHub October 2018 Incident Analysis

A 43-second network partition caused MySQL replicas to diverge, leading to 24 hours of inconsistency and widespread service degradation.

github.blog·Postmortem

principlesofchaos.org·Debugging

Principles of Chaos Engineering

The foundational manifesto for chaos engineering — deliberately injecting failures in production to build confidence before real failures do it for you.

Being On-Call

How Google structures on-call rotations, escalation policies, and manages the psychological impact of production responsibility.

sre.google·On-Call & SRE

increment.com·On-Call & SRE

On Call — Increment Issue 28

A collection of essays from engineers at Stripe, PagerDuty, and others on the human and technical sides of being on-call.

aws.amazon.com·Reliability

Timeouts, Retries, and Backoff with Jitter

How AWS engineers implement timeouts, retries, and exponential backoff with jitter to prevent retry storms in distributed systems.

How to Do Code Reviews Like a Human

Michael Lynch's guide to giving code review feedback that is kind, specific, and actionable — without sounding robotic or condescending.

mtlynch.io·Code Review

engineering.fb.com·Postmortem

Facebook BGP Outage — October 2021

A BGP configuration change withdrew all Facebook routes from the internet and locked engineers out of the tools needed to fix it.

Incident Response — The SRE Workbook

Google's practical guide to managing incidents — roles, escalation, communication, and what separates a well-run response from a chaotic one.

sre.google·Debugging

aws.amazon.com·Reliability

Avoiding Overload in Distributed Systems

How AWS uses load shedding, admission control, and graceful degradation to prevent cascading failures at global scale.