[Zope] ZODB data loss - help required

Thu Dec 5 06:42:07 CET 2013

Ive just had a very disturbing and odd experience with data loss in the
ZODB. Ive been running Zope servers for 15 years and never seen this before
until two previous incidents a few months ago. The sequence of events for
those incidents were vague so I put it down to user error, but now that its
happened a third time on a different site Im getting worried.

Plone 4.2.0.1 (4206)
CMF 2.2.6
Zope 2.13.15
Python 2.7.3 (default, Feb 6 2013, 01:00:51) [GCC 4.4.6 20120305 (Red Hat
4.4.6-4)]
PIL 2.0.0 (Pillow)

The site was been working perfectly for 4 months with many updates and
perfectly fine performance. Im now faced with the horrible job of
explaining this to the client and I dont quite understand it myself.

*Sequence of events*
The Plone site was launched to production in September and has been running
ever since. It has been restarted several times since then with minor
product updates.

Today I needed to do a yum update and then I restarted Zope using an init
script that stops Zope & ZEO and restarts them.

When Zope came up the Plone site was now showing data from August 5.
Somehow everything between August 5 and December 5 has been lost.

A scan of the transactions via the Undo Log or using data.fs tools shows
transactions up to August 5 then nothing until December 5

>From the logs I can see that the server itself was rebooted on August 5 and
there was also a copy of the Data.fs made on that date too.

We have not packed the database since that date as it was not large or
growing fast.

All backups also show the August 5 data - which means the file system
copies of the Data.fs were simply copying the old version of the database.

Ive cloned the file system and kept the original untouched although in the
first hour or so after the restart I did several file copies and moves as I
was unaware of the gravity of the situation.

Ive tried extundelete and debugfs and neither of them detect any deleted
Data.fs file that can be recovered.

*Questions*
My main aim is to work out what happened. My best guess is that Zope was
somehow connecting to a stale or outdated file pointer and updating that
file all along while the Data.fs was pointing to an August 5 copy. But how
could this situation eventuate and persist for so long?

The odd thing is that we had 2 very similar incidents on 2 different Zope
servers a few months ago but both resulted in almost no data loss as the
timeframes were shorter and I dismissed them as some odd user error.

We have recently moved most of our Zope servers to Linode - could it be
their file system? or could it be the new way we setup the buildouts and
init scripts?

Any clues at all would be welcomed.

-- 
*Tom Cameron*
Technical Director

*Mooball IT*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.zope.org/pipermail/zope/attachments/20131205/b9f5edb7/attachment.html>