SeniorEng - Learn Real Software Engineering

Job Queue Degradation

Problem

The task-queue service is experiencing severe degradation. Queue depth has increased from approximately 1,200 jobs to over 22,800 jobs and continues to grow. Traffic is only 18% above normal month-end levels, yet completion rates have fallen sharply. No significant errors are being reported, but jobs are spending far longer in the system before completion.

Queue depth

22,800 jobs (↑ rapidly)

Affected service

task-queue

Traffic condition

Month-end (+18%)

Incident started

14 minutes ago

Service Details

Cloud

AWS us-east-1

Instances

4 worker pods

Runtime

Python 3.9

Database

PostgreSQL 13

Queue

Redis Queue

Version

v2.1.0

Incident Workspace

Service Metrics

Environment

Production

Queue Depth jobs

13k

avg · 1h

23k17k12k6k0

12k

16k

20k

22k

23k

-60m-45m-15mNow

Inbound Job Rate jobs/min

106

avg · 1h

1309865330

100 jobs/min

101 jobs/min

100 jobs/min

101 jobs/min

102 jobs/min

100 jobs/min

101 jobs/min

102 jobs/min

103 jobs/min

104 jobs/min

105 jobs/min

106 jobs/min

107 jobs/min

108 jobs/min

109 jobs/min

110 jobs/min

111 jobs/min

112 jobs/min

113 jobs/min

115 jobs/min

117 jobs/min

118 jobs/min

-60m-45m-15mNow

Job Completion Rate %

65%

avg · 1h

100%75%50%25%0%

98%

97%

96%

94%

92%

89%

85%

80%

74%

67%

60%

54%

50%

47%

45%

44%

43%

42%

-60m-45m-15mNow

Error Rate %

avg · 1h

5%4%3%1%0%

-60m-45m-15mNow

Job Queue Degradation

Incident Commander Update

Queue depth is rising rapidly and approaching SLA breach thresholds. Jobs continue completing successfully, but throughput has dropped significantly.

The task-queue service is experiencing severe degradation. Queue depth has increased from a normal baseline of approximately 1,200 jobs to over 22,800 jobs and continues to grow.

Traffic is only 18% above normal month-end levels, yet completion rates have fallen sharply. No significant errors are being reported, but jobs are spending far longer in the system before completion.

You are the primary on-call engineer. Investigate the available telemetry, identify why queue throughput has collapsed, determine the true root cause, and restore service stability before customer SLAs are breached.

Real incidents need a real screen.

Service Metrics

Job Queue Degradation