Increasing levels of resilience within organizations reinforce the need to effectively employ the concept of “maximum allowable downtime (MAD)” alongside recovery time objective (RTO). Normally, RTO closely approximates MAD in the “classic” recovery scenario of a single production site and a single recovery site. This article will explore the relationship between these concepts, within the particular context of the highly resilient capabilities that many organizations possess. In those situations, RTO durations can be significantly longer than MAD, without increasing risk or inviting other negative consequences.
TO
AD?
Concepts
A capability of an organization can be a business function
or a technology application or system. An organization normally possesses numerous capabilities, in proportion to its size
and complexity. This portfolio of capabilities often possesses
varying degrees of resilience to disaster events. The level of
resilience derives from a number of factors, some of which an
organization may have purposefully crafted and some of which
may have randomly occurred.
Maximum allowable downtime is “the absolute maximum time that the system [or capability] can be unavailable
without direct or indirect ramifications to the organization”
( www.bcmpedia.org ). MAD focuses entirely on the capability customers’ (internal or external) perspective, answering the question, “How long can they manage without the
capability?” Importance and urgency determine the duration
of MAD.
The Business Continuity Institute’s (BCI) “Dictionary of
BCM Terms” identifies a number of related terms for MAD,
such as maximum allowable outage, maximum tolerable downtime, and maximum tolerable period of disruption. This article
utilizes maximum allowable downtime solely due to the useful
peculiarity of its acronym.
Recovery time objective is “the period of time within which
systems, applications, or functions must be recovered after an
outage. RTO includes the time required for: assessment, execution and verification” (DRJ Glossary of Terms).
RTO focuses on how quickly the business continuity team
can re-establish the affected capability in the recovery environment, and includes process-related transitional recovery
tasks when multiple sites are in production. RTO’s duration
emphasizes the physical recovery of people, technology and
facilities.
A “highly resilient capability” (HRC) exists when it is
dispersed across production sites in multiple locations, which
are sufficiently separated (“geo-dispersed”) to be unlikely to
become inoperative due to the same disaster event. High resilience means an outage in one location does not significantly
degrade the HRC’s overall production capacity. A single site
outage would be largely transparent to most, if not all of the
capability’s users.
A “capability resilience level” (CRL) can identify a numeric
value for an HRC. CRL describes the relative degree to
which a capability can be impacted by a single disaster event.
Organizations may define increasing CRLs with criteria of
their choosing. In this article, the level equals the number of
production sites which are geo-dispersed. CRL1 aligns with
one production site. As it is the lowest resilience level, CRL1
does not qualify as being highly resilient. CRL2 equates to two
geo-dispersed production sites, etc. Increases in the CRL correlate to increases in the capability’s resilience. Each organization can define the minimum value which corresponds to high
resilience.
A highly resilient capability may still have a relatively long
MAD, particularly if random factors caused its development.
For example, the people performing the function gradually
dispersed to different buildings over time. The duration of the