[ZWeb] ZCatalog Issues

Jim Fulton jim at zope.com
Tue Jul 13 08:39:29 EDT 2004


Shane Hathaway wrote:
> On Friday 09 July 2004 16:43 pm, Michael Bernstein wrote:
> 
>>2. Attempting to re-index the ZCatalog (by clicking the 'Update Catalog'
>>button in the 'Advanced' tab) causes a timeout or an error message.
>>There are over 179k objects, but it is my understanding that ZCatalog
>>should scale to that many objects fairly easily. I am given to
>>understand that Casey and Shane have been working on fixing this issue.

Note that:

- scalability is hard

- There are a number of dimensions on which to evaluate scalability

Reindexing time increases pretty dramtically with catalog size.
For better or worse, I don't think it has ever been a priority
to make reindexing an entire catalog happen without the browser
timing out. One reason for this is that this is generally a very
infrequent operation.  It ought to be possible to rebuild indexes
separately, but I don't know if we ever did this.

> 
> Well, here is what I've learned so far from analyzing the zope.org catalog.  
> Maybe this will help Casey.  Maybe others can help, too.
> 
> I exported the zope.org catalog as a .zexp and wrote a utility that roughly 
> analyzes .zexp files.  It took several hours, but since an export operation 
> does not unpickle the objects, it finished successfully.
> 
> The total size of the .zexp is 340 MB and it contains 572,231 objects.  Here 
> is a breakdown of the sizes of the objects:
> 
> 214387 objects 0-63 bytes
> 115033 objects 64-255 bytes
> 202881 objects 256-1023 bytes
> 30591 objects 1024-4095 bytes
> 5700 objects 4096-16383 bytes
> 3434 objects 16384-65535 bytes
> 194 objects 65536-131071 bytes
> 11 objects 131072-1048575 bytes
> 0 objects 1048576-2147483647 bytes
> 
> I decided to first look at the largest objects in detail.  I was happy to see 
> there are no 1 MB objects, but there are two 500K objects and nine objects 
> between 128K and 200K in size.  Each of those 11 objects is either an 
> IOBucket, an IOBTree, or an IISet.  At least three of them unintentionally 
> contain large, fully-rendered HTML pages (presumably because some indexed 
> object generates HTML for the given attributes.)
> 
> Note that zope.org currently has its per-connection database cache size set to 
> 23,000 objects.  The catalog can not fit in that space, and even if it did, 
> we'd run out of memory. 

Please explain why you think this matters.  In normal usage, the catalog should
not be loaded into memory in it's entirety.  Are you concerned about indexing or
about searching?


 > The box has 2 GB, and between two app servers, there
> are eight connections.  Each connection maintains its own copy of the 
> database.  340 MB is probably a low estimate of the catalog's full resident 
> unpickled size, but I'll use it anyway: keeping this catalog in memory would 
> take at least 340 MB * 8 = 2.7 GB.  That's also ignoring the size of other 
> objects loaded from the database connections.

Given that you are considering more than one database connection, I assume you
are worried about searching, not indexing ....

> So should we pile RAM into the box and boost the cache size to 600,000?  I 
> think that would be unwise.  I've seen evidence that the time spent managing 
> expiration in the ZODB cache rises exponentially with the number of objects 
> in the cache. 

Well, I suppose this is technically true, if

- We are loading lots of objects (and thus need to remove a lot of
   objects from the cache), and

- There are lots of objects in the LRU that are not, in fact,
   ghostifyable. IMO, it's a shame we have non-ghostifyable objects
   in the LRU.

 > Flushing a cache containing 20,000 objects can take minutes,

Huh? This makes no sense.  Flushing objects just frees their
state.  This should not take minutes.  If this is reproduceable,
we ought to do some profiling to figure out what the heck is
going on.


> and flushing a cache containing 60,000 objects can take an hour. 

Ditto, but more so....

 > Also, it's a bit difficult to work on this because it's all in C.

And because it's horribly overcomplicated.  I'll have more to
say on this one of these days on zodb-dev.

> It seems like this catalog contains simply too many objects.  A third of them 
> are very small (less that 64 bytes including the class name); I wonder if we 
> could combine some of these. 

Interesting.  I wonder what these are.

 > I think I'll next try to find out how many of
> the objects are in text indexes and lexicons.

This is a tough and painstaking analysis. Good luck.

Some things I'd look for:

- sorting

   If we are doing lots of sorted searches, that could cause lots of
   meta-data to be loaded.  I suspect that sorting on application attribtes,
   such as modification time, is the most common case of catalog abuse.

- Too much meta data.

- Maybe too many indexes

   I think a common problem in Zpe sites is that they have a single catalog
   that is used for a wide variety of independent searches.  I think that it
   would be more efficient in many cases to keep separate catalogs geared toward
   separate kids of searches.

I think it would be interesting to analyze:

- What sorts of searches people are doing and how much time they take.
   Is there an option to turn on elapsed time in the regular hit log? If not,
   there should be.

- For searches that take a lot of time, analyze how many and what sorts of
   objects are loaded to accomplish the searches.

In summary, if a catalog is being used *properly*, only a small fraction
(decreasing with increasing catalog size) of the catalog should be
loaded at any point in time.  I fear we make catalog abuse too easy though.

> There is a bit of good news: zope.org is not consuming gobs of RAM due to a 
> memory leak.  I wrote a small Python C extension that uses mallinfo() to 
> reveal how much heap a Python process is actually using for objects, which is 
> often much smaller than the process size as the operating system sees it.  
> Whenever I flush the caches in Zope, its heap usage shrinks to less than 10% 
> of its process size.  That means most of the memory is consumed by 
> reclaimable ZODB objects.  (I'll post the C extension on the web if anyone is 
> interested.)

But. over time, is the size it shrinks to constant? Ot is it increasing?

Jim

-- 
Jim Fulton           mailto:jim at zope.com       Python Powered!
CTO                  (540) 361-1714            http://www.python.org
Zope Corporation     http://www.zope.com       http://www.zope.org


More information about the Zope-web mailing list