[Zope] Non-responsive objects reprise

Sun Nov 13 12:20:13 EST 2005

Hello everyone, this is from an older thread which I'm resurrecting
with more information.

Despite Dieter's helpful pointers I'm no closer to solving this
problem but do have more information about it in case anyone can lend
a hand.

To quickly recap:  Periodically when visiting our zope site, certain
objects appear not to respond.  It's consistently the same objects
from a Page Template in one folder to an
image somewhere else.  The site is running on Zope 2.8.1, Python 2.3.5
and sitting behind a VHM and Apache 2.0.46 using the usual ReWrite
rules.  This problem suddenly started
several months ago with the site having been running smoothly for many
months prior. This is all on Red Hat Enterprise Linux ES release 3
(Taroon Update 2).  The server is a
dual processor with 1GB RAM, 300GB of hard disk space, hosted by
Rackspace.  The site is relatively large and reasonably active.  Its
content is largely made up of Page
Templates with a few supporting python scripts and Script (Python)'s. 
There are also a few ZClass-based objects that offer no real unique
functionality other than providing an
interface for the admins to create "News" or "Feature" items. The site
also utilizes a MySQL database.

I've noticed the following things about this problem:

=================
1) DeadlockDebugger shows no problems when one of the objects appears
not to be responding.  Everything appears normal.

2) I can ALWAYS successfully get to the non-responsive objects by
bypassing Apache and directly viewing the Zope server's equivalent
:8080 address.

3) While tailing the trace.log when an object is siezing through
Apache, I can see the request come to Zope and go right back out with
no problem.  I think that's what
this is illustrating:

B -1348776468 2005-11-13T10:46:37 GET
/VirtualHostBase/http/www.domain.org:80/portal/html/VirtualHostRoot/resources/contact
I -1348776468 2005-11-13T10:46:37 0
A -1348776468 2005-11-13T10:46:38 200 14938
E -1348776468 2005-11-13T10:46:38

4) Turning on debugging output for Apache shows the following proxy
errors when trying to access an offending object.  I've searched for
related information about this proxy and
only found one hit from the ZODB-DEV list from 2004 with no responses.
 The errors:

[Sat Nov 12 00:33:33 2005] [error] [client xx.xx.xx.xx] proxy: error
reading status line from remote server localhost
[Sat Nov 12 00:33:33 2005] [error] [client xx.xx.xx.xx] proxy: Error
reading from remote server returned by /contact
[Sat Nov 12 00:34:02 2005] [error] [client xx.xx.xx.xx] proxy: error
reading status line from remote server localhost
[Sat Nov 12 00:34:02 2005] [error] [client xx.xx.xx.xx] proxy: Error
reading from remote server returned by /resources/index_html

I removed the client IP.  Keep #2 and #3 in mind in the context of this problem.

5) In case there was something in one of the templates that was
screwing things up, I methodically removed portions of a page (or its
inherited template).  When the page suddenly started responding
through Apache I thought I hit paydirt, but then I noticed in one
instance that all I removed was a block of plain HTML (no METAL/TALES
statements) and that put me back at square one.  I think #2 and #3
make this point irrelevant, and certain images will get hung up, too.

6) The server is also running Mailman (using the same Python as Zope).
 It uses a seperate virtual host container in Apache to expose its
adminstrative interface.  One of my co-workers swears that when he
experiences the siezing, he soon after gets several emails from one of
the Mailman lists which is supposed to be a once-a-day broadcast-only
list.
I think this is more of a coincidence though, and I haven't gotten a
big enough sample size of occurrences to rely on this report.

7) Restarting Zope *usually* corrects the problem (on Friday,
restarting it (several times) didn't help)

8) Restarting Apache sometimes corrects the problem without needing to
restart Zope.

9) On one occasion killing Mailman suddenly made one of the offending
objects respond for a little then stop.

10) On the rare occasion we have had to physically reboot the server
(like on Friday).

11) After the server was rebooted on Friday, memory usage for Zope
went from about 3% to 20+% as reported by 'top' over a period of about
12 hours.  I don't know whether that is indicative of a leak or just
general memory consumption. Restarting Zope appears to return that
memory back to the OS.  This memory usage is what we normally see for
this site.

12) Upgrading from Zope 2.7.6 to 2.8.1 appeared to help for a little
while, but the problem either came back or never left.

13) I briefly enabled mod_disk_cache in Apache for this site in case
Zope was getting too stressed out.  It appeared to work wonders, but
some file objects, like PDFs, would
periodically be reported as corrupted by Acrobat after being
downloaded.  I assume this was a failure to configure mod_disk_cache
appropriately, and we've since disabled it (at
which point Acrobat stopped complaining about corrupted PDFs.  The
siezing problem looked as though it disappeared while mod_disk_caching
was enabled.  Indeed, Watching the Apache and Zope logs showed
requests more often being fulfilled only by Apache than by Zope. 
Perhaps the proxy problems in #4 is indicative of a loaded Zope that
needs caching. We are not running ZEO or anything like that.  Perhaps
we should.
=================

Apologies for the long email but I have no idea what's going on... if
ANYONE has ANY suggestions or ideas on what else I could investigate
it would be GREATLY appreciated!

Thank you!

Garth

On 8/17/05, Dieter Maurer <dieter at handshake.de> wrote:
> Garth B. wrote at 2005-8-16 19:21 -0400:
> > ...
> >When I hit an offending
> >folder, I simply get no response and the browser just waits and waits.
>
> Visit "Control_Panel --> Debug Information" and check whether
> your request indeed does not finish.
> You see the active requests at the bottom of this page.
>
> If this page shows you, that the request was finished by Zope,
> then you hit a wide spread browser bug:
>
>   In some cases, Zope returns an 204 (no content) response.
>   For unknown reasons (and against the HTTP spec),
>   most browsers treat this as "continue to wait".
>
> If the request is not finished, you can use
> Forent's "DeadlockDebugger" to find out where your
> request is spinning.
>
> --
> Dieter
>