[BreachExchange] IT outages are inevitable, here’s how to effectively manage your next one

Thu Apr 25 09:03:28 EDT 2019

https://www.datacenterdynamics.com/opinions/it-outages-are-inevitable-heres-how-to-effectively-manage-your-next-one/

In the last few months, we’ve seen some major IT failures: a daylong
Wells Fargo outage that prevented customers from accessing their
accounts, an Amtrak failure that left 60,000 Chicago passengers
stranded, and a global outage of Gmail and Google Docs that prevented
people from using those products.

Make a plan

And then there was the VFEmail.net hack in February, which resulted in
complete loss of all client data – including backups.

These and similar IT problems offer us two important takeaways:

1. IT outages can happen to anyone (and will eventually happen to everyone).
2. The extent of damage your next IT outage causes depends on how well
you prepare for it right now.

It’s also important to note that over 60 percent of IT outages or
“disaster events” are caused by human error. So how can you minimize
the damage that your next IT outage causes to your revenue,
reputation, and customers?

First, make sure you have a business continuity plan (BCP) that
includes both a disaster recovery plan (which outlines how you’ll
handle your IT) and a plan for keeping the rest of the business going
(e.g., communicating if key channels are down, making sure key people
know what’s going on, establishing a meeting place, defining a chain
of command, etc.).

Here, I’ll outline four crucial steps for being effective on the IT side.

Define potential disaster scenarios

For most companies, there are two major IT disaster scenarios:

- System outage, in which some key part of your network or application
malfunctions and you or your services are “offline” for a period of
time. This is, usually, a relatively easy point of recovery as you are
back online with minimal transactions impacted by the downtime.
- Data loss, in which you lose information, content, or data (either
your own or your clients’). It’s not always possible to recover from a
data loss, as in the VFEmail.net hack, in which all copies of backups
were deleted.

The first step to ensuring you’re ready for a disaster is
understanding your risk profile for these common types of outages:
what capabilities will be affected by a system outage? How crucial are
those capabilities to running your business? Will an outage cause data
loss? What other events might trigger data loss? Etc.

And again, remember that human error will be the most prevalent cause
of both types of disasters (as in the Amtrak incident, when a worker
fell on a circuit board during a server update).

Assess the potential damage to your business

This is a job for IT and other leaders to do together. The goal is to
understand how your business, as a whole, will be affected if its
individual pieces are down or if various types of data are lost.

In these conversations, aim to understand dependencies among
business-critical apps (e.g., you know you need the payment processing
app to be live, but does it depend on the inventory app to function?),
clarify the effect on users that outages will have, and assess the
financial impact of each minute of downtime for your business.

Make a plan

And then there was the VFEmail.net hack in February, which resulted in
complete loss of all client data – including backups.

These and similar IT problems offer us two important takeaways:

IT outages can happen to anyone (and will eventually happen to everyone).
The extent of damage your next IT outage causes depends on how well
you prepare for it right now.

It’s also important to note that over 60 percent of IT outages or
“disaster events” are caused by human error. So how can you minimize
the damage that your next IT outage causes to your revenue,
reputation, and customers?

First, make sure you have a business continuity plan (BCP) that
includes both a disaster recovery plan (which outlines how you’ll
handle your IT) and a plan for keeping the rest of the business going
(e.g., communicating if key channels are down, making sure key people
know what’s going on, establishing a meeting place, defining a chain
of command, etc.).

Here, I’ll outline four crucial steps for being effective on the IT side.

Define potential disaster scenarios

For most companies, there are two major IT disaster scenarios:

- System outage, in which some key part of your network or application
malfunctions and you or your services are “offline” for a period of
time. This is, usually, a relatively easy point of recovery as you are
back online with minimal transactions impacted by the downtime.
- Data loss, in which you lose information, content, or data (either
your own or your clients’). It’s not always possible to recover from a
data loss, as in the VFEmail.net hack, in which all copies of backups
were deleted.

The first step to ensuring you’re ready for a disaster is
understanding your risk profile for these common types of outages:
what capabilities will be affected by a system outage? How crucial are
those capabilities to running your business? Will an outage cause data
loss? What other events might trigger data loss? Etc.

And again, remember that human error will be the most prevalent cause
of both types of disasters (as in the Amtrak incident, when a worker
fell on a circuit board during a server update).

Assess the potential damage to your business

This is a job for IT and other leaders to do together. The goal is to
understand how your business, as a whole, will be affected if its
individual pieces are down or if various types of data are lost.

In these conversations, aim to understand dependencies among
business-critical apps (e.g., you know you need the payment processing
app to be live, but does it depend on the inventory app to function?),
clarify the effect on users that outages will have, and assess the
financial impact of each minute of downtime for your business.

Review your current disaster recovery plan

Once you know what kind of downtime your business can reasonably
afford, take a look at your current DR plan. If you’re like most
businesses, you have one but haven’t been diligent about updating it
or testing it regularly. Now’s the time to change that.

As you review your DR plan, consider the following:

- Does it reflect the realities of your business today, including
plans for business-critical apps as articulated in your earlier
conversations? If not, hop down to the next section, because you’ll
need to update it.
- Is it right-sized? IT teams are excellent at coming up with creative
ways to do DR. This is in part because these systems are their babies
and they’re very attuned to all the ways things can go wrong. But
elaborate DR is often more than a company needs – and more expensive
than the company can afford. If you’ve determined that you can afford
three days of downtime and your current DR plan has you back online in
six hours, it’s time to make some changes. Again, refer to RTO and RPO
here.
- Have you tested it? I get it. Many DR plans are developed to check a
box or meet a regulatory requirement. But if you don’t test your plan,
it’s worthless to you in a real disaster. You have no way of knowing
whether it will actually prevent the kind of revenue loss and
reputational damage that unexpected outages and data loss can cause.

Update and test your DR plan

I work with a lot of businesses. Most of them don’t regularly update
and test their DR plans. It’s a nice-to-have project in a world of
must-have projects. That’s a big problem because an outdated DR plan
is more or less worthless in the event of a real disaster.

Take these steps as you make changes:

- Assign someone to be in charge of DR and testing. This means someone
will be accountable if it goes wrong, which significantly increases
the chances that testing gets done.
- Make sure the C-suite is aligned with the importance of having a DR
plan and conducting regular stress tests. This is crucial to get the
participation you’ll need from non-IT colleagues.
- Include a definition of “disaster.” Know when and how you’ll launch
your DR plan – after an hour of downtime? A day? Define, too, who
makes this call and who makes the call if that person is out.
- Put disaster-prevention rules in place. The Amtrak disaster I cited
earlier happened in part because the company did a server update
during peak usage hours. That is an incredibly preventable error: if
the worker had fallen on the circuit board in the middle of the night,
very few travelers would have been affected and the story may not have
made the news.
- Include a communication plan. Being transparent with stakeholders
during a disaster (“here’s what’s happening”) and after (“here’s what
happened and what we’re doing to improve performance in the future”)
will go a long way toward mitigating any reputational damage a
disaster may cause.

Effective DR is all about details

While it’s true that every business should have and test a DR plan,
it’s also true that no two businesses are alike in what they need or
how they should respond to disasters. For any business, DR should be
based on two things: their risk profile and their ability to recover
from an event – large or small.

To make sure your next IT outage causes as little damage as possible
to your customers, your revenue, and your reputation, spend time
understanding the specifics of what can go wrong and how those
problems will affect your customers – and build a DR plan to minimize
that impact.