[Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

abel deuring a.deuring@satzbau-gmbh.de
Tue, 26 Jun 2001 20:40:45 +0200


Hi all,

Giovanni Maruzzelli wrote:
> 
> We think that Abel is absolutely right:
> 
> if in the same almost empty folder we add and catalog an object with one
> word (and now we have optimized and reduced the number of indexes to 11) it
> make a transaction of 73K, while if the object contains 300 words with the
> same other indexes or properties, the transaction is 224K, and if all is the
> same but the object contains 535 words the transaction is 331K.
> 
> And we are using now a catalog with only some 3000 document indexed with a
> medium lenght of each document around 1K.

Well, Chris certainly knows more about the internals of ZCatalog than I
do, so we should not ignore his comments to my mail :)

Chris McDonough wrote:

> > If you now add a new document containing 5 of these frequent words, 5
> > larger BTrees will be updated. [Chris, let me know, if I'm now going to
> > tell nonsense...] I assume that the entire updated BTrees = 120000 bytes
> > will be appended to the ZODB (ignoring the less frequent words) -- even
> > if the document contains only 1 kB text.
> 
> Nah... I don't think so.  At least I hope not!  Each bucket in a BTree
> is a separate persistent object.  So only the sum of the data in the
> updated buckets will be appended to the ZODB.  So if you add an item to
> a BTree, you don't add 24000+ bytes for each update.  You just add the
> amount of space taken up by the bucket... unfortunately I don't know
> exactly how much this is, but I'd imagine it's pretty close to the
> datasize with only a little overhead.

OK, this made me curious, so I made test similar to the one by Giovanni.
I started with a ZCatalog containing 21616 records; the catalog contains
only one text index, no keyword index, no field index. I copied one of
the indexed documents; the text is 2645 bytes long; wc tells me that it
has 313 words. Next, I packed the data base in order to have a "clean
start point". After packing, Data.fs has a size of 233661963 byte.

Then I cataloged the new object using my "lazy catalog". Since I have
only one new document, this is basically the same as using
CatalogAwareness. After indexing, the data base has grown to 233851090
bytes -- an increase of 189127 bytes. Then I packed the data base again,
resulting in a size of 233666237 bytes.

So the "net increase" is indeed 233666237-233661963 = 4274 bytes, as you
expected, but obviously a few more data base records need to be updated.

Abel