[Zope] Serious write-performance issue with multiple threads.

Arnar Lundesgaard arnar.lundesgaard@creuna.no
Mon, 17 Jun 2002 20:15:55 +0200


Hi,

For some time now we have had trouble with write performance on two ZOPE
sites we have in operation. The first one has been in production since
early April, while the second is being opened to the public in a few
days. They are both based on our internally developed TopicMap engine
(built on top of CMF). (To be OpenSourced when we have some available
time to finish it up.)

The problem has been scarily easy to reproduce, and manifests itself
whenever multiple persons are working (writing articles etc). As long as
only one person is working on the site, it has the expected performance.
Not very fast, a few seconds per write operation, but acceptable.
However as soon as more people log in and start publishing, we often
experience hangs which may last for 10-15 minutes on write operations
that usually takes only a few seconds to perform. In this situation Zope
doesn't respond to any requests.

Fortunately visitors to the first site are served by the cache (squid on
the same machine) and reads from ZOPE have ranged from very fast to
mostly quickly.

Although we do not have any other ZOPE sites with the same write-load as
these sites, the TopicMap engine has naturally been our main suspect.
(Load on the first site is about the same as 'zope.org' but with much
higher peaks.) It seemed obvious that something was/is with the design
of our TopicMap engine that triggers this.

We have been reviewing our own code and made many optimizations that we
expected to yield significant speedups. While they did make the site
faster on average, write operations still often result in long hangs
when multiple users were working on either of the sites.

We followed the procedure in
http://www.zope.org/Members/4am/debugspinningzope. The hang is indeed a
spinning process (thread) as it always uses 99.9% CPU when Zope stops
responding; attaching to the main process in a hang situation and
looking at the responsible thread invariably shows it to be in
chunk_free() in libc somewhere downstream from pickleCache. Zope will
usually "unhang" itself after about 10-15 minutes of spinning.

Mr. Kromers comments on ZOPE and SMP in the 'system requirements' thread
a couple of weeks ago gave us a few clues, which we followed. We tried
binding the ZOPE instance to one CPU using the affinity patch for Linux
2.4, but that did not help either. We then tried disabling one CPU,
suspecting SMP trouble, but still no go.

So slowly ZOPE became the suspect. Tried different number of threads,
20, 10, 4, 2... then we we tried running with only one thread, and the
"hang" problems vanished!
What's more; finally we were getting the performance we expected. It
works very well now, even under relatively high load and a lot of write
activities.

The hardware and software of the two sites:

Site 1:                               Site 2:
  A dual Xeon 1.2 GHZ                   A dual Xeon 1.0 GHZ
  2 GB ram                              512 MB ram
  RAID 1+5                              RAID 1+5
  CVS ca. ZOPE 2.5.0                    ZOPE 2.5.1
  CVS ca. CMF 1.2                       CMF 1.2
  Python 2.1.2 (PThreads)               Python 2.1.3
  Linux 2.4.14 (XFS)                    Linux 2.4.18 (ext3)

Python is compiled with thread support and large file support. for both
sites The database for site 1 is approx 450 MB newly packed - ca. 220
000 objects; site 2 isn't live yet and is much smaller (~100MB packed).

We're experiencing the same problems on both sites.

The second site will be going live in a few days. Due to its design and
requirements we are not able to cache as much content. This forces us to
look for a different solution; the one-thread option won't cut it; too
many requests will have to go outside the cache...

Currently we are running this on a single thread, but we expect that to
"kill us" once it is opened to the general public :(

A possible workaround is of course to run two different ZOPE instances
with a ZEO backend. One with multiple threads for reads and visits, the
other with a single thread for writing and publishing. This is perhaps
the ideal solution, but we are loath to make untested changes to the
production environment just before going live.

Secondly this feels like a bug in ZOPE or Python, and if it is we would
like to track it down.

What we are looking for is information on threading in Python/ZOPE/ZODB,
other peoples experiences, workarounds, etc.


Regards,
  Arnar Lundesgaard


----------------------------------
phone: (+47) 982 38 036
mailto:arnar.lundesgaard(a)creuna.no
Creuna as=20
Bryggegata 3=20
NO-0250 Oslo=20
phone office: (+47) 23 23 88 00=20
fax: (+47) 23 23 88 50=20
http://www.creuna.no/=20