standards for both power and cooling.
This will ensure a level of maintainability
as well as resiliency in the event of a loss
of power grid utility or other unexpected
event.
Additionally, be sure that facilities are
designed to operate independent of the
electrical grid and “hardened” to withstand
outside influences such as earthquakes or
terrorist acts. Actions as simple as positioning bollards in front of an entrance
to deter a threat or placing a cement wall
between transformers to guard against a
catastrophic ground fault can minimize
risk. By performing an onsite inspection
and discussing different scenarios with
your staff, you can gain a solid understanding of what single points of failure might
exist within your infrastructure and what
plans may need to be in place to prevent
unauthorized intrusion, weather-related
threats, and other outside influences that
could impact operations.
When it comes to the data center network, the same concepts apply. Are there
any points of commonality in your IP
network between your edge devices and
your exhaust point onto the Internet?
Don’t overlook seemingly small details
if you have redundant network devices.
But they connect to the same patch panel:
UPS (uninterruptible power supply)
system, breaker, or other infrastructure;
then all of that built-in redundancy at
the network device level is for naught.
Regardless of how good the MTBF (mean
time between failures) is for your hardware, it is inevitable that those components will eventually fail. The same can
be said for routing protocols. Eventually
human error, device failure, or software
bugs will impact routing. So, can you
failover seamlessly between all devices
that might experience a failure? Is that
failover automatic, or would it require
manual changes?
Take a serious look at sparing. Do you
have “hot” spares of every part onsite, or
would it require a service call to a vendor?
Do you have an assets depot where you
stock parts? How quickly can you get
parts onsite, should something fail? Once
the part is onsite, how quickly can you get
someone to assist with swapping it out? In
some cases, your vendor may be willing to
provide on-site inventory as part of their
contract (e.g. UPS parts kit, HVAC com-
pressors). These are all key questions to
ensure that your infrastructure and design
are adequate for disaster scenarios.
Documented Emergency and
Response Plans
It is critical that you have well-documented emergency preparedness and
disaster response plans. While similar,
both plans should be specific to the geographic location and type of facility. These
plans identify actions that will prepare the
data center operations team in case of an
emergency, including the necessary steps
that must be taken before, during, and after
an event.
For example, a prototypical inclem-
ent weather preparedness plan will spe-
cifically address the risks associated with
severe weather including tornados, thun-
derstorms, hurricanes, and floods. The
preparedness plan should include specific
tasks the operations team should perform
at predetermined times leading up to the
event – such as arrangements for contrac-
tor and supplier support, any changes to
staffing levels, and hotel reservations if an
extended event is expected, among other
things. These tasks should be repeated at
regular intervals with a final plan in place a
minimum of 12-24 hours before the event.
Customers expect and appreciate regular
communications during an event, even if
the update is as simple as “engineers are
on-site and monitoring the situation.”
For data center providers, documenta-
tion should be easily accessible to custom-
ers upon request in both soft and hardcopy,
contain critical contact information includ-
ing the provider’s management team, and
escalation procedures to ensure command
and control maintainability throughout the
event.
When it comes to your IP network, it
is imperative that you know how to react
in a disaster scenario, whether a problem
is caused by a hardware or network fail-
ure. Do you have escalation procedures?
An on-call rotation? Who can assist, and
how quickly? Unlike data center facil-
ity or system issues, where the cause of a
problem is often more obvious, network
failures often require deeper inspection
and detection before troubleshooting can
begin. Do you have automated tools that
monitor network health and routing sta-
bility? If so, are response measures taken
automatically, or do they require human
intervention? Are those response mea-
sures documented, and is everyone aware
of them?
Mock Disaster Drills
Site disaster preparedness plans may
look great on paper, but only through test-
ing and conducting drills will you truly
be prepared for an event. You should test
these plans at least two times a year. By
conducting simulations of an event, you
will be able to verify whether your opera-
tions personnel are knowledgeable of their
responsibilities and if the infrastructure
equipment performs as intended. Aim
to perform quarterly or similar simula-
tions of equipment failures, power out-
ages, and other related critical equipment
events may occur as a result of a disaster.
The findings of these mock drills should
be documented on an ongoing basis, and
the training program should be modified
as needed to ensure all on-site personnel
are well trained in case of an event. In
essence, a data center operator’s disaster
preparedness philosophy should be, “If
we’re not finding problems when we test
our plans and equipment, we’re not testing
thoroughly enough!”
If you are a provider, you should also
be an integral part of your customers’ IP
network disaster testing. They can run
their own network disaster readiness tests,
but that, in and of itself, is not enough.
As they run tests on their end, you’ll need
to be able to answer questions including:
what do you see when they fail over?
Do you have to take corrective action on
your side? Who should they contact, work
with, and escalate to? Have you made any
changes to your infrastructure since their
last disaster drill that might have changed
how their drill needs to be operated? Be
sure to communicate any relevant network
changes immediately to ensure custom-
ers’ disaster preparedness isn’t negatively
impacted. If your facility is part of a multi-
tenant building, you may want to plan
your testing in conjunction with estab-
lished building test schedules.