[Zope] ZEO troubles on RedHat EL4 Linux

Willi Langenberger wlang at wu-wien.ac.at
Thu Aug 18 21:22:14 EDT 2005


According to Tim Peters:
> I don't know.  Dieter asked whether you ran the tests via "zopectl
> test", but I didn't see an answer to that.

Ok, here some data points...

  bender:~/Zope-2.7.7-final$ cat /proc/version 
  Linux version 2.6.9-11.ELsmp (bhcompile at decompose.build.redhat.com) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:26:27 EDT 2005

  bender:~/Zope-2.7.7-final$ python2.3 
  Python 2.3.5 (#1, Apr 19 2005, 14:53:39) 
  [GCC 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)] on linux2
  Type "help", "copyright", "credits" or "license" for more information.

Running one single test:

  bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$
  Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
  ======================================================================
  ERROR: checkNoVerificationOnServerRestart (ZEO.tests.testConnection.FileStorageReconnectionTests)
  ----------------------------------------------------------------------
  Traceback (most recent call last):
    File "/home/wlang/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown
      os.waitpid(pid, 0)
  OSError: [Errno 10] No child processes

  ----------------------------------------------------------------------
  Ran 1 test in 0.689s

  FAILED (errors=1)

After some retries, the same test passes:

  bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$
  Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
  ----------------------------------------------------------------------
  Ran 1 test in 0.691s

  OK

Interesstingly, if i run the test with strace, i never see the test
fail (i tried at least 30 times):

  bender:~/Zope-2.7.7-final$ strace -e trace=signal -o /var/tmp/zeotest.trc python2.3 test.py testConnection checkNoVerificationOnServerRestart\$
  Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python
  ----------------------------------------------------------------------
  Ran 1 test in 0.710s

  OK

(Obviously a Heisenberg effect -- the observation influences the
behaviour ;-)

If anyone is interessted in the trace file -- it can be found at:

  http://slime.wu-wien.ac.at/misc/zeotest.trc

(However, it would be way more interessting to see the syscalls while
the test is failing...)

Also, i debugged the whole test with the python debugger. Unfortunatly
(as with strace), i was not able to reproduce the failing of the test
in the debugger.

> the ZEO tests spawn processes directly via Python's
> os.spawnve(), and later waits for them to end, via the waitpid() code
> shown earlier.  It doesn't muck around with signals, forks, or
> anything else that should be platform-dependent (the same ZEO-test
> process code is used on both Linux and Windows, BTW -- for this
> reason, it can't rely on any fancy signal or process gimmicks;
> spawnve+watipid is the entire story here).

Yes, its as simple as that: zeo ist started, zeo is stopped, and when
the parent calls waitpid, we get the "No child processes" error most of
the time :-(

Any ideas what we can try to narrow this down?

> All the failures you showed were in test teardown.  If that's all the
> failures you got, then all the test bodies actually passed.  Of course
> you have to be wary that normal methods of detecting child-process
> termination aren't working as hoped on this box, because all the test
> failures you reported were exactly failures to detect child-process
> termination.

Sure -- we could just make this change:

  bender:.../ZEO/tests$ diff ConnectionTests.py.ori ConnectionTests.py
  121c121,124
  <                 os.waitpid(pid, 0)
  ---
  >                 try:
  >                     os.waitpid(pid, 0)
  >                 except OSError:
  >                     pass

then all tests will pass. But then we will not know why the zeo zombie
vanishes before the waitpid can reap the exit code ;-)


\wlang{}

PS: i'am afraid it turns out to be a python thread / signals / race
problem -- yuck!

-- 
Willi.Langenberger at wu-wien.ac.at                Fax: +43/1/31336/9207
Zentrum fuer Informatikdienste, Wirtschaftsuniversitaet Wien, Austria


More information about the Zope mailing list