nCASE STUDY
Managing Downtime
in a Hyper Connected
Web 2.0 World
By JACQUES GREYLING
On June 29, Rackspace xperienced a local- ized outage in its Dallas-Fort Worth, Texas (DFW) data center, its first outage in more than
two years. The following week, the same
data center experienced another outage,
this time shorter in duration and affecting a smaller amount of customers. What
resulted was an important learning opportunity for the team at Rackspace, from
technical preparations to communication
best practices. During the outages the
number one priority was putting our customers first, and as a result, maintaining
Rackspace’s integrity and customer relationships while also developing an ambitious plan to reinvest in our infrastructure
to prevent future outages.
What Rackspace is Doing Now
Rackspace has learned from this
experience and has developed a plan of
action:
1. Put the best people on it, and bring in the
experts. A team of the best Rackspace talent
from the US and the UK have been brought
together to focus on the issues. The team will
be joined by top talent from our vendors, as
well as knowledgeable outside consultants
to ensure any and all known and unknown
issues are considered and resolved.
2. Assess the status of the infrastructure. The
Rackspace team is combing through the
data center and assessing every link in the
chain.
3. Improve standard operating procedures.
Increase the frequency of testing, monitoring
and measurement programs within the data
center. Maintenance schedules will change
and the level of detail reviewed internally and
shared externally will increase.
4. Invest. Continue to invest in the data center
infrastructure. Investment in additional
information systems as appropriate will also
be made to support new measuring and
management procedures.
What Happened
Rackspace operates nine data center
facilities across the world, including the
one in DFW. Over a period of ten days,
Rackspace experienced two power disruptions incidents in one of three phases of that
data center which could not be prevented
by the redundant design of Rackspace’s
systems. While limited to a single phase of
this data center, the disruptions did impact
some of our customer base served by that
phase of our data center.
How Rackspace Responded
As is always the case, Rackspace’s first
priority was getting its customers back up
and running. Customer uptime is a principle at the heart of Rackspace Fanatical
Support. Rackspace and its team of
Rackers (Rackspace’s term for its employees), took an “all hands on deck” approach
to remedying the situation and minimizing
the impact of the outages. For Rackspace,
this meant calling in teams who were not
scheduled to work, and dedicating extra
hours to make sure that customer issues
were addressed in a timely manner. When
the main phone lines were busy, Rackers
turned to Twitter as a vehicle for customer
communication, supplying mobile phone
numbers to ensure customers had as many
points of contact into Rackspace as possible.
While the team in the DFW data center
worked tirelessly to identify the source
of the disruption as well as to resolve it,
Rackers on the support teams turned to
social media channels, like Twitter and
Rackspace’s corporate blog (
www.rack-space.com/blog), to keep customers up-to-date on the progress made in returning
to normal operations. Transparency and
regular communication was important,
and Rackers on Twitter responded directly
to customers and made regular updates via
the @Rackspace handle. After Rackspace
was able to obtain details and root cause
analysis regarding these disruptions, CEO
Lanham Napier, who strongly believes that
forthcoming and honest communication is
the best way to maintain customer trust
and satisfaction, even posted a video blog
to the Rackspace blog explaining the cause
of the outages. Another central tenant of
Fanatical Support is taking responsibility.
Regardless of the source of these disruptions, Rackspace did not make excuses
or point fingers. Instead, Rackspace took
responsibility and fully honored its service level agreements (SLA) for its customers. Rackspace has the best SLAs in
the industry and will not hesitate to make
it right with its customers when there is a
disruption.
Conclusion
While no data center is risk-free, managing and mitigating the risk to acceptable
levels is paramount at Rackspace. In the
case of the Rackspace DFW data center,
the power infrastructure has been stabilized, although Rackspace will continue
to be hyper-vigilant in monitoring and
responding to any irregularity.
Rackspace’s goal is to use this unfortunate experience to grow to be better
and stronger, both technical innovation
and customer relationships. Based on
initial feedback from customers and the
industry, everyone at Rackspace is proud
of the way it was handled and how the
company communicated about the outages. Even with a temporary setback, it
is more important to Rackspace to maintain long-term credibility and trust as a
world class hosting organization that will
continue to evolve and learn from these
experiences.
v
Jacques Greyling, operations director at
Rackspace, is responsible for a variety of
groups including network security, managed backup, SAN infrastructure, DC engineering, business services and operations.
Greyling is an eight-year veteran of Rackspace. He previously held engineering roles at organizations including
Nissan and Datacentrix.
32 DISASTER RECOVERY JOURNAL FALL 2009