[Zope] system down - how to prevent?

Roel Van den Bergh roel@planetinterior.com
Wed, 5 Mar 2003 13:34:02 +0100


Hi all.

We have been working with Zope for over a year now and we like it.
It has been a long time since I had to call upon you people to help me out.

Yesterday, our production server went down the hard way.
While someone was editing some objects in the ZMI everything crashed. Not
only Zope, but the entire machine.

We are running redhat 7.3, apache and Zope 2.5.1 + CMF 1.3 on a Dell
PowerApp 120 dual PIII 1Ghz, 1 GB ram, Raid 5 so we thought we would be
rather save having also backed up Data.fs by our hosting provider.

After the crash the server would not start up again, indicating 'memory
failure'. After several retries nothing works anymore.
The machine is still under warranty so there is no real problem either.

Now comes the funny part.
There doesn't seem to exist any backup of the Data.fs.
Checking our systemlogs prior tho the incident revealed this:

Feb 14 02:54:08 piwebserver Retrospect[27997]: FSGetNodeInfo: lstat failed
on "/home/zope/2-5-1/var/Data.fs", error 75
Feb 14 02:54:08 piwebserver Retrospect[27997]: FSGetNodeInfo: lstat failed
on "/home/zope/2-5-1/var/Data.fs.old", error 75
Feb 14 02:54:12 piwebserver Retrospect[27995]: connTcpConnection: invalid
code found: 111

Now they tell us their backup program can connect to our server, but the
Data.fs file cannot be backed up because it is locked / in use.

Our firewall is open to the backup program
$IPT -A tcp_inbound -p TCP -s 111.111.111.111 --destination-port 497 -j
ACCEPT

How come? We can manually create a copy of the file.
Has anyone had these problems and how did you solve them.

Secondly we are investigating how to prevent downtime of the production
server in the future.
I had a quick peek at ZEO but I'm a bit lost there.
What is the minimum setup for a production site to be kept alive (not
necesaraly with the same specs)
As far as I can tell you need at least three machines to keep your site
alive:
a 'load- balancer', a 'client' and a 'server'. Could this be narrowed down
to two machines?

And what if the actual 'ZEO server' goes down?

TIA, WKR, Roel