[Zope] mystery of the server hang solved?!

Kyler B. Laird laird@ecn.purdue.edu
Mon, 09 Jul 2001 16:34:20 -0500


For quite awhile, I've been having problems with
our Zope server just becoming unresponsive.
Usually I note that one python process is
monopolizing one of the CPUs.  I turned on the
detailed logging and Apache's server status, but
I still couldn't peg down the problem.

Last week it got better.  I didn't have to
restart the system nearly as often.  This
coincided with the conclusion of a proposal
submission process that runs as part of a
system that I suspected was the culprit.

Things got so quiet that I decided to try the
upgrade to 2.4.0b3 again.  I succeeded enough
that I moved it into place over the weekend.

Today I ran into one of our support people who
I noticed was frequently on when the server
went bad.  He had even complained that he was
having a hard time getting to know Zope with it
going down all the time.  (Hint!)  I mentioned
that I had changed some things and it seemed
more stable, and invited him to give it another
try.

A few minutes later, I was working with someone
else when the server became unresponsive.  A
quick check of Apache's status showed that the
support person I'd just been talking with had
several processes waiting for a response.

Ah ha!

Well...I restarted and went digging.  He didn't
have much there.  There wasn't even a Python
Script.  But then I noticed a little 'H'...in
the icon for his standard_html_footer.

I had modified the PUT_factory so that when
text/html is uploaded, an HTMLDocument is
created.  Apparently standard_html_footer was
created in this way.  It was trying to wrap
itself! 

I thought that there were safeguards against
such recursion, but they didn't seem to catch
this.  It'd be nice if HTMLDocument would
verify that it's not calling itself.

I'm thrilled to have found the problem.  We've
only got about 30 authors right now.  I suspect
it's going to get a lot tougher when we add a
few thousand.  I need to get ready.  (I'm
planning to kill requests that take too long to
complete.)

I appreciate all the help I've received here in
my attempt to track down this problem.

--kyler