[Zope-dev] Re: [Zope] highly available Zope thread; our hanging problem

Wed, 07 Jun 2000 00:20:31 +0900

On Tue, 6 Jun 2000 15:19:29 +0200 
Marcus Collins <mcollins@sunesi.com> wrote:

> Hi,
> 
> I'd like to comment on this, and summarise some references below. Much of
> this discussion took place on the zope-dev list (see references), so I'm
> cc'ing the zope-dev list. You might also wish to add to the Wiki:
> http://www.zope.org/Members/tseaver/Projects/HighlyAvailableZope/.
OK, that's a good suggestion!

> 
> > -----Original Message-----
> > From: Brian Takashi Hooper [mailto:brian@garage.co.jp]
> > Sent: 06 June 2000 12:11
> > To: zope@zope.org
> > Subject: [Zope] highly available Zope thread; our hanging problem
> > 
> > Hi all -
> > 
> > I was looking at the discussion from April that was posted on the
> > HighlyAvailableZope Wiki about problems with Zope hanging; we had a
> > similar situation here at Digital Garage which seemed to be alleviated
> > by changing the zombie_timeout to be really short (like, 1 minute). 
> > Before changing the zombie_timeout, the server would periodically hang
> > and not give any responses to requests, sometimes recovering after a
> > short time.
> 
> Some questions at this point:
> 1. Were you running with multiple threads, and if so, how many?
Yes; Zope is set to run with 16 threads (-t 16), and we've increased the
pool_size parameter in ZODB/DB.py to 16 also (guess this is all
right... :-P )

> 
> 2. If you were using multiple threads, would *all* the threads periodically
> hang, or was the hanging isolated to a single thread at a time?
All the threads hang.  One interesting thing is, we looked at vmstat and
whenever the system is having trouble, the number of system calls drops
dramatically, when the server is doing well it's normally up in the
1000s, but when it's in trouble there are like 20-30 system calls per
second, and they're all either lwp_* or poll s.

> 
> 3. Could you possibly comment on the operating system used?
Solaris, 2.6, on netras.  Our Zope is still v. 2.1.4.

> 
> 4. Which zombie_timeout did you twiddle -- the one in the zhttp_channel in
> ZServer.py, or that in http_channel in medusa/http_server.py?
The one in zhttp_channel.  As far as I can tell, since zhttp_channels
are actually used instead of http_channels, the number in zhttp_channel
is the one that matters.  The kill_zombies method, and the code that
calls it, is inherited from the medusa code... kill_zombies looks at the
timeout value of all the channels in the select list, and since all of
those instances happen to be zhttp_channels in the case of Zope, they
all use the zhttp_channel timeout.

> 
> > At this point, I don't have anything more than just an empirical
> > observation - changing this parameter seemed to help our server.  Has
> > anyone else noticed anything similar, or can explain this observation?
> 
> Concerning the zombie_timeout suggestion, here are some references when I
> posed the question of whether reducing the value would be beneficial:
> 
> Amos Lattier wrote in
> http://lists.zope.org/pipermail/zope-dev/2000-April/004194.html:
> > The ZServer zombie stuff is to get rid of zombie client 
> > connections, not zombie publishing threads. These are quite 
> > different beasts.
> 
> Michel Pelletier wrote in 
> http://lists.zope.org/pipermail/zope-dev/2000-April/004229.html:
> > What the Zombie timeout means is that after a publishing thread gets
> > done answering a request, the socket may not go away.  This many for a a
> > number of reasons, the client 'hung' and is not 'putting down the phone
> > after the converstation is over' (so to speak) or network troubles may
> > prevent the connection from closing properly.  This means that there is
> > a 'zombie' connection laying around.  This zombie will probably end up
> > going away on its own, but if not, ZServer will kill it after a period
> > of time.
> > 
> > The only reasorce laying around during the life of a Zombie is an tiny
> > little unused open socket, the Mack truck of a Zope thread that served
> > the request for the zombie socket does not 'hang' for that entire period
> > of time, but goes on after it has completed the request to serve other
> > requests.
> > 
> > Amos is correct in that these problems are almost always at the
> > Application level, and not at the ZServer level.  The fact that Pavlos
> > can prevent hanging by inserting a print statement in the asyncore loop[*]
> > is suspicious, but we do not have enough information yet to point
> > fingers anywhere.
> 
> [* references http://lists.zope.org/pipermail/zope/2000-April/023697.html]
Yeah, I saw this... like I said, I haven't gathered enough information
yet to be able to say anything that sounds like an explanation; all I
have is a vague experimental observation.

I found out about the mpstat command on Solaris, didn't know about it
before, it gives you info on thread activity and multiprocessor
behavior, so maybe I can get some more info from that.

Hmm.

--Brian Hooper