DATA CENTER DESIGN
10 Steps for Avoiding
Data Center Disaster
Due to Human Error
involved is important. When mistakes happen, the human error
should be investigated; a root cause analysis completed; and
findings communicated and acted upon. For organizations with
multiple sites, that communication can contribute to business
agility. By sharing lessons learned across sites, data center managers become more aware of potential issues and are better able
to address them in a timely manner. After all, average downtime
cost, based on the whitepaper mentioned at the beginning of this
article, is more than $90 a second, so every second counts.
Implement Training
Ongoing training and precautionary policy development
is also essential to preventing human error. Since data centers
themselves are highly complex, interconnected systems, training programs and exercises among the different IT groups that
emphasize a holistic approach to data center management could
help address the problem. Training should also encompass safety
best practices, as well as compliance with the standards of the
Occupational Safety and Health Administration (OSHA) and the
National Fire Protection Association (NFPA). At the very least,
data center managers should ensure that all individuals with
access to UPS equipment or other systems have received basic
training and can easily obtain procedures for operating the systems properly in order to avoid costly mistakes.
Control Documentation with Revision Approval Process
Ensuring consistent system operation through documentation is not always a given. Sometimes data center managers
get too comfortable with operating the systems, do not follow
procedures, forget or skip steps, or perform the procedure from
memory and inadvertently shut down the wrong equipment. It is
critical to keep all operational procedures up to date and follow
those instructions to operate the system.
This is why a documented method of procedure (MOP) that
clearly defines the tasks that take place during a maintenance
window is crucial. A standard MOP can be the answer to many
unforeseen human errors. This step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Ensure back-out plans are included in case of
unanticipated events. Not only is a documented method crucial,
but having it audited regularly for accuracy is important. There
should be a pre-defined approval process for each revision in
order to ensure all parties involved have the latest information.
Update One-Line Diagrams & Post Equipment Map
Updated one-line diagrams should be standard in every data
center. More times than not, when you enter a large, Fortune 500
data center you’ll find that an accurate one-line diagram is not
available. With design upgrades or equipment removal taking
place all the time, it’s common to have multiple drawings, but
it’s difficult to determine which is the most accurate. If the wrong
diagram is used when making adjustments to the date center,
human error is inevitable. Any time you make changes to the data
center, updating one-line diagrams is vital in order to know a data
center’s power and cooling capabilities and infrastructure. At the
minimum, an annual review of one-line diagrams and procedures
By AHMAD MOSHIRI
According to Confucius, the Chinese philosopher, the cautious seldom err. However, when it comes to data centers, human error is often the cause for unplanned owntime despite what I’m sure are actions of well- intentioned individuals.
As most of you know, downtime is simply not an option for
many organizations such as banks, telecommunications companies,
Internet service providers, and cloud/co-location facilities. These
businesses rely heavily on the availability of their data centers.
Because of this reliance, the cost of downtime can be catastrophic.
According to a power vendor whitepaper that analyzed the
financial impact on infrastructure vulnerability, the average cost
of data center downtime was about $5,600 per minute. Based
on an average incident length of 90 minutes, a single downtime
event was about $505,500.
More than half of respondents in the Ponemon Institute’s
“National Survey on Data Center Outages” cited accidental emergency power off (EPO) or human error as the root cause of their
outages over the previous two years. More than half of the 450
data center professional polled also agreed the majority of downtime events could have been prevented. Below are 10 best practices for mitigating human error in the data center.
Limit Access According to Secure Policies
There are many types of professionals linked to the data center
such as IT, emergency, security, and facility personnel, as well
as external vendors. All have varying technical backgrounds and
access levels, so limiting access to the data center can reduce your
risk of human-caused downtime.
Additionally, having a sign-in policy is critical. Organizations
without data center sign-in policies run the risk of security
breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know
who is entering and exiting the facility at all times. Once established, these access protocols should be documented and followed. Data center managers should also do periodic audits of
those accessing the data center to have a clear picture of who has
access and why. This audit can eliminate any unnecessary access.
Regularly Communicate to Stakeholders
With the sheer number of people at all levels of the organization that are concerned about or responsible for some aspect
of data center operations, providing regular updates to all those