08Инциденты

Post-mortem template (blameless)

Blameless шаблон с timeline, 5 whys, action items под owner+deadline и анти-паттерн «post-mortem theater».

Действуй как Staff SRE. Напиши blameless post-mortem по инциденту {{incident_id}} (severity {{severity}}). Документ должен быть таким, чтобы через 6 месяцев новый человек прочитал его и понял что произошло, почему, и что мы изменили.

Принципы blameless

Атакуй системы, не людей. «X задеплоил без review» → «процесс позволил deploy без review, потому что CI gate был отключён в hotfix flow».
Опиши контекст, в котором решение казалось правильным. «Дежурный сделал rollback вручную, потому что автоматический rollback не запустился из-за expired token — об этом не знал, не было документировано».
Hindsight bias запрещён. Не пиши «должны были заметить» — пиши «не было сигнала, который бы заметили».
Цель — учиться, а не наказывать. Если в post-mortem есть фраза «должен был» / «не должен был» — переписать.

Структура документа

1. Header

Title: Post-mortem: {service} — {short symptom}
Status: Draft / In review / Published
Severity: {{severity}}
Author: {IC name}
Reviewers: {manager, SRE lead, service owner}
Date of incident: {YYYY-MM-DD}
Date of post-mortem publish: {YYYY-MM-DD}

2. Summary (≤ 5 строк)

Что: одно предложение «{service} был недоступен / деградирован для {scope}».
Когда: «{start UTC} → {resolve UTC}, total {duration}».
Impact: «{N affected users} (~{Y%} of total) experienced {what}».
Cause (one-liner): «{root cause in plain english}».
Mitigation: «{what we did to stop}».

3. Impact (цифры, без оценок)

Users affected: конкретное N (если знаем) или upper bound с обоснованием.
Duration: TTD (time to detect), TTM (time to mitigate), TTR (time to resolve).
SLO budget burned: Y% от monthly error budget.
Revenue impact: $X (estimated) — если можем считать, иначе «not quantified».
Customer support tickets: N opened during incident.

4. Timeline (UTC, real timestamps)

Записывай факты и решения, не интерпретации.

HH:MM — Deploy of service X v2.3.4 (deploy ID dep-12345)
HH:MM — Alert "high-error-rate" fired in #alerts-checkout
HH:MM — On-call @alice acked, looked at dashboard
HH:MM — Confirmed elevated 500s on checkout endpoint
HH:MM — Declared SEV2, IC assigned (@bob)
HH:MM — War room opened (#inc-2026-05-17-checkout)
HH:MM — Hypothesis: recent deploy. Verified: yes, dep-12345 30 min ago
HH:MM — Decision: rollback to v2.3.3
HH:MM — Rollback started
HH:MM — Rollback complete, metrics improving
HH:MM — Metrics stable at baseline for 15 min, resolved
HH:MM — Status page updated to resolved

Включи отвергнутые гипотезы — это критически важно для будущего: «думали про X, но отбросили, потому что Y».

5. Root cause analysis (5 whys минимум)

Не остановись на первом «потому что». Иди до процессного и организационного уровня.

Why 1: Checkout returned 500. Why? — Service couldn't get DB connection.
Why 2: Couldn't get connection. Why? — Pool exhausted at 100 connections.
Why 3: Pool exhausted. Why? — New endpoint added in v2.3.4 had N+1 query pattern.
Why 4: N+1 not caught in review. Why? — PR reviewed by 1 engineer, no load test gate.
Why 5: No load test gate. Why? — Load tests run weekly, not per-PR. Decision was "expensive in CI time".

Categorize causes:

Technical: N+1 query, missing connection pool monitoring.
Process: no load test in PR pipeline, no review checklist for DB-heavy changes.
Organizational: team has no DB performance owner, no informal "DB review" culture.

6. What went well

Не пропускай — это так же важно, как «что плохо».

Alert fired in 90 seconds (target: 2 min) ✓
On-call acked in 45 seconds ✓
Rollback runbook was up-to-date and worked first try ✓
War room collaboration was clean — no parallel debugging confusion ✓

7. What went poorly

IC assignment took 8 minutes (target: 3 min) — manager wasn't paged automatically on SEV2 declare
Status page updated only after 45 min (target: 15 min) — Comms role wasn't assigned until 30 min in
No one checked downstream dependencies during incident — could have widened the scope check

8. Action items

Каждый ITEM = owner (person, not team) + deadline + tracker link + category.

| # | Category | Action | Owner | Deadline | Tracker |
| 1 | Prevention | Add load test gate to PR pipeline for endpoints touching DB | @alice (SRE) | 2026-06-15 | SRE-1234 |
| 2 | Detection | Add connection pool saturation alert (warn at 80%, page at 95%) | @bob (Service) | 2026-05-30 | SVC-567 |
| 3 | Mitigation | Auto-rollback on error rate > 10x baseline in 5 min post-deploy | @carol (CI/CD) | 2026-07-01 | CICD-89 |
| 4 | Process | Page manager auto on SEV2+ declare | @dave (IC tooling) | 2026-05-25 | IC-12 |
| 5 | Process | Comms role assignment in declare checklist | @eve (SRE) | 2026-05-20 | SRE-1235 |

9. Lessons (1-3 предложения для weekly digest)

Lesson 1: DB-touching code without load testing is a recurring failure mode. We need a CI gate, not "remember to load test".
Lesson 2: SEV2 needs as much process rigor as SEV1 for declare/roles. The lighter process bites us on TTR.

Anti-pattern: post-mortem theater

Post-mortem написан, лежит в Confluence, никто не вернётся. Через 3 месяца — тот же инцидент.

Как избежать:

Action items в трекер в тот же день, что и публикация post-mortem. Не «в моей TODO» — в real tracker с owner + deadline.
30/60/90 day check: автоматический ping owner'у action item'а через эти промежутки. Если ticket не закрыт через 90 дней — escalate manager'у.
Post-mortem review в quarterly retro: какие action items закрыты, какие нет, почему. Team-level accountability.
Cross-reference: при новом инциденте — поиск по существующим post-mortems на похожий root cause. Если нашли — это regression, отдельная категория (более серьёзная).
Не публикуй post-mortem без action items. Если нечего сделать — это не post-mortem, это incident log.
Один owner на action item. «Команда сделает» = никто не сделает.
Не пиши post-mortem «для галочки». Если ты пишешь и думаешь «никто это не прочитает» — пересмотри audience, может быть нужен 1-pager вместо 8 страниц.

Anti-patterns

❌ Blameful tone: «Alice deployed without checking» — переписать в системную причину.
❌ Action items без owner: «Team should add monitoring» — кто? когда?
❌ Post-mortem пишется через 3 недели — детали забыты, timeline врёт.
❌ Только technical causes, без process/organizational — повторится через месяц по другой причине.
❌ «We were unlucky» / «freak accident» — никогда. Всегда есть системная причина, ищи дальше.
❌ Post-mortem не публикуется широко — другие команды не учатся, повторяют те же ошибки.

Формат вывода

Markdown по структуре выше. Сначала черновик, потом review с reviewers (60 мин meeting, не более 10 человек, читать async до meeting'а).

Принцип: хороший post-mortem — это документ, который через год прочитает новый человек, поймёт что произошло, и не повторит. Если он повторит — post-mortem был theater.

К подразделу «Инциденты»

Похожие промты

site / deploy

Playbook отката деплоя

От симптома до отката: как обнаружить, как откатить (git revert / pm2 prev / db), smoke-тесты, пост-мортем.

deployrollbackincident

Открыть

Продвинутый30-60 мин

agents / debugging

Постмортем агентского сбоя

Разобрать почему агент справился плохо, и сформировать конкретные правки в промт.

agentsdebuggingpost-mortem

Открыть

Начальный15-30 мин

code / docs

Runbook для инцидента: шаблон

Симптомы → first response → escalation → проверки → восстановление → пост-мортем. Живой документ, не отчёт.

docsrunbookincident

Открыть

Средний30-60 мин