u;Server redirection,;in;which;the;nodes
monitor;each;other;via;heartbeats
and;command;the;movement;of;client
systems;from;a;failed;node;to;surviving
nodes.
Split-Brain Mode
The reliability of the replication network is paramount. Should the replication
link between two nodes fail and no further
action taken, each node will continue to
process those transactions routed to it but
will be unable to replicate those changes to
the other nodes. This is called split-brain
mode. The database contents will diverge,
and transactions will be executed against
stale data. When the replication link is
restored, the databases can be synchronized by draining the changes that have
built up in the change queues. However,
data collisions are bound to happen during
the restoration process.
In some applications, split-brain operation may be acceptable. If it is not, one
of the nodes must be taken out of service
until replication can be restored. For
this reason, the replication links should
always be redundant and independent so
that no single failure can prevent replication.
Other Advantages of Active/
Active Networks
In addition to providing continuous
availability via fast and reliable failover,
active/active networks provide many other
advantages:
u;Planned;downtime;is;eliminated;since
upgrades;can;be;rolled;through;the
application;network;one;node;at;a;time.
u;It;is;easy;to;add;capacity;by;simply
adding;nodes.;Bigger;systems;do;not
have;to;be;purchased.
u;Load;can;be;easily;redistributed;to
balance;the;network;by;reassigning;users
to;nodes.
u;Performance;can;be;improved;by
stationing;processing;nodes;close;to
communities;of;users.
u;Failover;can;be;easily;tested;since;it
is;fast;and;reliable;(a;major;reason;for
failover;faults;is;that;failover;testing;with
active/backup;systems;is;expensive;and
risky;and;therefore;is;often;not;done).
u;Lights-out;operations;can;be;easily
supported;since;the;pressure;to;recover
a;failed;node;is;significantly;reduced.
u;All;purchased;capacity;is;usable.;There
is;no;idle;backup;system;sitting;around
waiting;to;take;over.;However,;there;must
be;enough;capacity;to;handle;the;load
if;a;node;fails.;In;a;two-node;network,
this;means;that;both;nodes;must;be;able
to;handle;the;full;capacity,;similar;to;an
active/backup;system;(though;the;load
will;be;split;between;the;two;nodes,;thus
improving;performance;and;providing
additional;peak;processing;capacity
if;needed).;However,;if;there;are;four
nodes;in;the;network,;each;must;be
able;to;carry;only;one-third;of;the;traffic
in;order;to;survive;a;single;node;failure.
Thus,;only;4/3;of;the;required;capacity
must;be;purchased;rather;than;twice;the
capacity.
Other Costs of Active/Active
Networks
The hardware, staffing, and site
requirements necessary to build an active/
active network are substantially the same
as those required to build an active/backup
system. However, there are additional
costs that must be considered.
u;Redundant;and;independent;replication
links;must;be;provided;to;connect;the
nodes;in;the;application;network.
u;A;replication;engine;must;be;licensed.
u;Additional;licensing;costs;may;be
incurred;since;duplicate;hardware;and
software;are;being;used;simultaneously
rather;than;one;set;being;used;only;for
standby;purposes.
u;Applications;may;have;to;be;modified;to
make;them;active/active;ready.
u;Distributed;system;management;tools
may;have;to;be;licensed;and;staff;trained
in;their;use.
u;The;new;active/active;environment;must
be;thoroughly;tested;and;operation
procedures;thoroughly;documented;and
practiced;before;putting;the;new;system
into;production.
However, these additional costs must
be evaluated in light of the downtime
costs that will be saved by going to a
continuously-available environment. For
instance, if downtime costs a large company $100,000 per hour, and if the active/
active network can save eight hours of
downtime per year, savings in downtime
costs add up to $800,000 per year. This
can cover a major move to active/active.
Relationship to BC/DR
Once we get our active/active network
up and running and believe that we now
have truly continuous availability, can we
forget about business continuity and disaster recovery planning? Absolutely not!
For one thing, IT is only one element
in the BC/DR plan. The plan must cover
all business processes. For another, continuous availability is relative. It really
means that the probability of system
failure is extremely small – perhaps not
even measurable. But it is not zero. There
is some very small chance that an event
beyond our comprehension can take down
the entire system. Though this might not
happen for hundreds of years, it could
happen tomorrow.
Case in point – the recent botched virus
upgrade by McAfee on April 21, 2010, took
down thousands of systems all over the
world. If we weren’t smart enough to roll
upgrades such as this through our system
one node at a time, we could have lost all
of the nodes in our active/active network.
BC/DR planning is as important as it
ever was.
Summary
The continuous availability of active/
active networks can bring many benefits to
an enterprise, from the savings of downtime cost to regulatory compliance, not to
mention the dreaded CNN moment when
the company makes international headlines following a major system failure.
There are many concerns that must be
evaluated such as the problems of data collisions, data loss, and split-brain mode; the
effort and risk associated with modifying
old legacy applications; and the additional
costs incurred. These concerns must be
balanced against the many advantages of
active/active networks such as the elimination of the cost of downtime, improved
user experience, and regulatory compliance. The bottom line is that many companies have found that their investment
in active/active technology has paid off
many-fold.
v
Dr. Bill Highleyman;is;the;managing;editor
of the Availability Digest (www.availabili-
tydigest.com),;which;focuses;on;the;tech-nologies of continuous availability. He is
also;co-author;of;the;three-volume;series
on;active/active;networks;titled;“Breaking;the;Availability
Barrier.”