[Zope-CMF] Are there scaling issues with the Catalog?

sean.upton@uniontrib.com sean.upton@uniontrib.com
Sat, 24 Aug 2002 14:32:18 -0700


Keep in mind you may run into other limitations besides the catalog with
this many number of documents/objects.  That's not bad news, just a
warning... I can say that I have a system that stores and indexes (using
Catalog), ~350k objects in a folder (these objects are proxies to data
records stored in an RDB).  I use mainly FieldIndexes but a few TextIndexes,
and the amount of text content for the TextIndexes is small.  With this high
number of documents and/or other content objects, you need to make sure that
your container(s) can scale.  For example, after a few thousand items,
ObjectManager-inherited methods like _getObject() and _setObject() get very
expensive; in this case, my workaround was to bypass the ObjectManager
methods (and thus other things like Zope security mechanisms) and use
BTreeFolder's _getOb() and _setOb() methods which read and write objects
directly from the BTree.

My hunch is that Catalog will scale, if the app is done right.  However, I
think the place that could get the most scary is a mass re-build of an
entire index with this many items, not because of Catalog, but because of
lack of mass-scalability of methods that most people index.  For example,
say you were indexing a word doc, and had to extract the text every time via
some method called docAsPlainText() - and that this called a word to text
import filter like wvware; no imagine if you will, calling that filtering
process thousands of times.  This is the kind of thing that the Catalog has
to go through.  Have lots of ram (I mean LOTS!) to do this if you are going
to, and be patient).  The other thing worth knowing is that the Catalog will
automatically use subtransactions every n*1000 objects (where n is a number
I can't remember at the moment) every time you reindex the object without
you explicitly requesting them (Catalog keeps a counter for each
transaction, and increments it every time you index an object).

Subtransactions swap out portions of the transaction to disk so they don't
chew through memory, aiding scalability, but dropping performance (search
Zope.org for subtransactions for more detail).  This is not all that
tweakable, though, since Catalog calls subtransactions for you even if you
don't want to use them.

Long story short: I think search/query to catalog will perform quite well
for you, but you may want to consider carefully how you will index/re-index
your documents as part of your design (a few at a time is likely not a
problem, I'm mainly speaking of a bulk-reindex).  I think, in theory, if you
had some amazing magic, you could distribute the load of a bulk reindex with
parallel indexing via ZEO to scale this a bit (since, if I understand
correctly, some indexes might auto-generate ids for objects in their BTrees,
and that if you could coordinate your ZEO clients to likely be reindexing
objects likely in different buckets, you might be okay, but you would have
to handle conflicts); this, however, may or may not be possible, and even if
it is, it's way too tricky for a mere mortal like me (dealing with massive
amounts of conflict errors gracefully, understanding index design well
enough to coordinate two boxes to be statistically likely to write to
different buckets - assuming that is possible, network/ZEO socket latency,
and lots more bumps, I think).

Good luck.  I would suggest making sure any other scalability issues in your
app (like scaling containers/folders, if you need to) are out of the way
before taking on dealing with Catalog scalability.  Once you get to that
point, though, the best thing to do is try a bulk index of 100k+ objects and
see where you get.  I'm not sure you will get to where you want to be
without giving yourself some time and the patience to test and optimize a
bit.

Good luck,

Sean

-----Original Message-----
From: J C Lawrence [mailto:claw@kanga.nu]
Sent: Saturday, August 24, 2002 1:59 PM
To: alan runyan
Cc: zope-cmf@zope.org
Subject: Re: [Zope-CMF] Are there scaling issues with the Catalog?


On Sat, 24 Aug 2002 10:58:32 -0500 
alan runyan <runyaga@runyaga.com> wrote:

> have you tried to simulate this?  

Not yet.  I wanted to ask first in case I was clearly heading into
trouble.

> and are you 'indexing' 2k of data per document? 

Yes.

> or just metadata associated with the document?  

The metadata will be relatively small in each case.

> I remember ken indexing hundreds of thousands of news posts and doing
> searches and being happy about his results.  

Got a pointer?  (That's pretty similar to my use)

> But if you go down this route you should probably do some due
> diligence.

Yup, that's why I'm asking up front.

> you will need tons of ram.  

Of course.

> use subtransactions on the initial loading.  

Mind explaining a bit?  (I haven't looked into the transactional
structure of the catalog yet).

> I would suggest ZEO and dshaw's Advance Site Setup howto.

Thanks.

-- 
J C Lawrence                
---------(*)                Satan, oscillate my metallic sonatas. 
claw@kanga.nu               He lived as a devil, eh?		  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.


_______________________________________________
Zope-CMF maillist  -  Zope-CMF@zope.org
http://lists.zope.org/mailman/listinfo/zope-cmf

See http://collector.zope.org/CMF for bug reports and feature requests