[BreachExchange] IT Postmortems: How to continuously improve by learning from failure and success
Destry Winant
destry at riskbasedsecurity.com
Fri Aug 24 09:50:54 EDT 2018
http://www.bmc.com/blogs/it-postmortems/
The worst thing that can happen to an IT team is a production outage –
critical systems, services, or data are unavailable. No matter what,
you immediately go from a normal day to feeling stressed, angry,
frustrated, and pressured to get it fixed ASAP.
Once you have the problem fixed and major systems restored, you
probably want to forget the whole thing ever happened. Don’t.
Instead, reflect back on what went wrong to determine a way to
minimize the chances this exact outage will happen again. The good
news is you don’t have to create a process for this reflection from
scratch. The even better news is you can do it just a couple hours, if
you’re really focused. The process is called an IT postmortem and you
can follow some very specific templates, too.
What is a postmortem?
Performing a postmortem may sound a little dark and depressing, but
it’s actually meant to shed light on a significant problem. A
postmortem process comes at the end of a project and helps you both
determine and analyze successes, non-successes, and failures. The
outcome of this process is usually a report that aims to inform best
practices and mitigate risks in the future. You may know it by other
names like lessons learned.
In IT, postmortems tend to be very focused: when a severe problem
happens, like an event that has an immediate impact on users. This
could be an outage or downtown or a data loss.
The problem with IT postmortems
The idea of an IT postmortem probably isn’t foreign to you. In fact,
maybe you’ve been involved in one but decided to scrap it for more
“important” work. Or, maybe you filed the report but now that it’s
hidden away somewhere, the recommendations therein haven’t been
adopted.
These are the two biggest problems with creating IT postmortems:
people dismiss them as non-essential, so the reports aren’t always
read, let alone adopted, by the people who can affect change. Because
of this, many people immediately see postmortems as an unworthy
investment of time and resources.
Depending on the workplace, you may think that it’s just a blame game:
determining who did what incorrectly at a moment of significance. Or
you may just think your memory is better than it actually is, that
you’ll remember what to do or not do the next time this arises.
For a postmortem to be useful, it must provide specific
recommendations for changes, such as policy or processes. If it’s just
documenting for documenting sake, it’s a waste of everyone’s time.
Creating a good IT postmortem
The responsibility to research, write, and publish a postmortem report
lies with the project manager or the person most responsible for a
particular outage or data loss. (By responsible for, we mean the
person who immediately begins fixing it, not the person who caused it
– as many times, these outages occur without human interference.)
An IT postmortem report does not need to be complicated. In fact, its
simplicity encourages completion. A good report should list a lot of
information, but most of it is readily known or quickly determined
upon addressing the problem. Reports should include the following
information:
- Report details: title, date, authors, status, and summary
- Problem details: size and time of event, software used, impact and
objects, detection
- Resolution details: triggering event(s), root cause(s), and who worked on it
- Recommendations: lessons learned and actions items to affect change
For a little more structure, make sure your IT postmortem answers
these questions:
What happened?
Why it happened? This can include:
- Identifying major events
isolating root causes, if possible
- Looking at technical pieces: Were design, process, poor maintenance
the underlying cause or the trigger that lead to a technical failure?
- Looking at non-technical pieces: How did organization, management,
and team environment improve or detract from the problem and its
resolution?
- What about the effect of things like culture, time crunches, and
budget pressures?
How did the team respond?
- Include each attempt to fix something, whether it resulted in a fix or not.
What steps will prevent this from occurring again?
- This is the crucial step, so it might feel the hardest.
- Create an action plan that continues to implement the successes and
begins to address what didn’t work.
- Be bold in identifying big sweeping changes that need to occur, but
might be beyond your authority or budget.
- Identify smaller changes that take no time or money to implement,
perhaps just a process change or an added step that can verify
something.
Tips for conducting IT postmortems
- Do it right away! The time for a postmortem is immediately after
you’ve wrapped the project or as soon after the triggering incident as
possible, especially if it had an immediate impact on users, such as
an outage, downtime, or data loss. The postmortem process should be
built into your scheduling. If not, you lose precious recall around
exactly what happened and how good or bad something was. We tend to
remember really bad things, gloss over other things, and forget our
successes
- Do it quickly. Do not spend a lot of time on this – as the project
manager, you should have the answers to most tracked items. It can
take a quick half-day or even just 1-2 hours to provide impactful
information.
- Use a tried and true template. You’re not writing award-winning
stuff here, it’s the content and recommendations that matters. A good
template should track a lot of things, worrying little about how well
written it is. A strong template also helps you get a postmortem
completed in an afternoon, so that it doesn’t have to take a long
time. (A quick online search turns up dozens of templates – experiment
to find what works best for your team.)
Involve more parties. Different people have different insights, and
involving the whole team prevents scapegoating. This can be as simple
as asking each person who was involved to send the thing they thought
went well when dealing with the problem and the thing that went the
worst.
- Ignore punishment. The point of a postmortem is not to find fault,
place blame, or punish. The point is to improve, so encourage honesty.
- Track positives and negatives. Not all postmortems have to be gloom
and doom – some can highlight positives in a process that you may not
have been aware of. In that case, perhaps your recommendation is to
rollout these positives more widely.
- Publish the report. Postmortems don’t have to lurk in a basement
storage area, among old files. In fact, you don’t even have to print
it out – simply share the findings with the team, the department, or
the company and decision makers as whole, whatever makes sense for
your work environment. A bonus: publishing will help you keep things
short and concise, too!
The outcome of (and attitude around) IT postmortems won’t improve if
you continue to minimize the importance of IT postmortems. Next time
you create a postmortem, consider following a reliable template and
commit to implementing the changes, as much as you have the authority
to do so.
More information about the BreachExchange
mailing list