[Zope] Static fail-over

Mon, 21 Jan 2002 17:41:46 -0800 (PST)

-> This might be hard to do with a load-balancer setup alone, because
-> the balancer would need to know when to fail over (and not just in
-> the case of Zope stopping to respond);

	Please, let's get our terms straight: a "load-balancer" would
direct web traffic to nodes according to some heuristic.

	A fail over system is where the service "fails over" to the backup
when a problem is detected.  It does not balance anything, least of all
load.

	It's worth noting that every "load-balanced" cluster [I've ever
seen] had some kind of heartbeat test that, when failed, would remove a
node from the cluster (so it doesn't get any more traffic).  If you only
have two nodes, this would be similar to a fail over system, but *not* the
same: a particular node would be "failed out", but no "fail over" would
occur, because there would be no backup system taking "over" the
responsibilities of a primary system.  Instead, you'd just have a failed
node in your cluster.


-> lines of IP takeover software like Heartbeat (linuxha.org) or Failsafe
-> (oss.sgi.com) that makes the "safe" hot-backup Zope box takeover the
-> identity of the "corrupted" primary Zope box, assuming that you have a
-> monitoring setup on the backup to audit the integrity of your primary Zope
-> service/data.

	I'm highly interested in any real-world, in-production load
balanced or fail over systems for Zope (esp. using Open Source software).


-> I would suggest a combination of Squid in front and heartbeat or Linux
-> Failsafe on Zope boxes (either independent ODB or with ZEO) nodes in this
-> case.

	Is there a way to use URL Rewriting rules in Apache (with
mod_rewrite) to test if a particular box was alive, and only if so, direct
traffic there?  Maybe have it look if a particular file exists (or some 
such)?

	(Also note that Apache will do much of what Squid will do using
mod_proxy.)


-> You could use ZEO+ExternalMount or ZSyncer to copy content to your second
-> Zope if you had 2 antonymous Zopes.  Squid (or any other reverse-proxy or
-> load-balancer) would then obey any IP address takeover from a hot-backup
-> node happening via gratuitous/unsolicited ARP.

	I.P. address take-overs are dangerous.  What if the Zope processes
die, but the O.S. is okay?  You'll have an I.P. address conflict unless
you can run a script on the primary box that tells it to shut down it's
network interface.  So what if the hardware locks up/loses all
resources/gets into a loop of somekind?  The NIC will still respond to its
I.P. address, but you can't run the script to disable it.  Bad
situation--pray you have a watchdog card for those Zope processes.

	MAC address takeovers are somewhat dangerous, because the switch
that you are connected into (such as at a Data Center) may not recognize
the MAC address takeover if the NIC on the primary box is still
responding (as above).

	I prefer solutions that keep all nodes (primary, backup, or any
peer nodes) behind a NAT.  Each node gets its own 192.168.0.x I.P.  
address, and the NAT box does all failover.  You've now moved the I.P.  
takeover problem to the NAT box (with its backup), but since NAT is in the
kernel (under Linux, at least) you'd be hard-pressed to find a NAT box
that could respond to an ICMP or serial-port ping but not do NAT.  If the
kernel is running, it's running, and if it's not, it's not.


--Derek