[Zope] ZCatalog Queries...

Andy Dawkins andyd@nipltd.com
Thu, 31 Aug 2000 16:49:56 +0100


>
>  > > > It's be nice if ZCatalog had a good general purpose
> > > > interface, and was a
> > > > bit more robust.
> > > > (the BTree implementation which has been mentioned a few
> > times springs
> > > > to mind here ;-)
> > >
> > > Can you be more specific?
> >
> > Andy can fill you in on the specifics.
>
> OK...

Specifics, hmmm OK

Machine 1: (2gb free hard drive space, 64Mb Memory, 128Mb swap, AMD Athlon
600, Linux Red Hat 6.1)

I created a new Zope Instance and installed ZMailIn porduct on to it.
(Catalog Aware)
I created a ZCatalog.
I sent 30,000 mail messages from zope@zope.org to the ZMailIn product.
These messages were catalog aware so they indexed themselves, one-by-one in
to the catalog.  Indexing the entire message body.

I went home.  When I came back the next day the machine had successfully
completed 5,000 messages, eaten all the memory, eaten 99% of the swap and
was chugging away a 1 mail message per 5 seconds.
Additionally the Data.fs was 1.5Gb

I decided this was going to take forever and ended it all.
(FYI: After packing the database it strank to 30Mb)

I started again from nothing.
This time i didn't create a ZCatalog.
I ran the import routines and 30,000 documents were successfully imported in
to the ZODB through ZMailIn in about 2 hours.
I created a catalog and manually added all items of type "ZMailMessage".
After about an hour of serious crunching.  All memory eaten, all swap eaten.
It reports that we are out of hard drive space.
I tried again after with various sub transactions sizes (ie. 10000, 5000,
1000, 200) but no change, except on the amount of time before it blows up.

After a couple of messages to zope@zope.org and zope-dev@zope.org we decide
that it may be worth another try without transactions, i.e. Replacing the
ZODB3 with The Berkeley Database.

Machine 2: (4.5gb free hard drive space, 256Mb Memory, 256Mb swap, Dual
PentiumIII 450, Linux Red Hat 6.2)

I created a new Zope Instance and installed ZMailIn porduct on to it.
(Catalog Aware)
I plugged in the BerkeleyDB
I imported all 30,000 mail messages.
I selected all ZMailMessage instance for cataloging and let it go.
About half a hour later the thing crashes with the error "file too large"
and corupts my Berkeley Database.
I try several sub transaction sizes (10000, 5000, 1000, 200) to no avail.

The only time I ever got it to work is if I removed the body of the mail
message from the catalogs indexes.  Which is the most important thing to be
indexed.

After tearing my hair out, i have put it aside and am working on something
else.

(I hope that is what Chris meant by Specifics)

>
> >
> > >  What's insufficient about the current
> > > implementation?
> >
> > It doesn't scale well, especially for things where you have
> > lots of new
> > data arriving (this is the BTree problem, I think...)
>
> Yes, we're still working on a "broadtree" implementation that may allay
> some of these problems, although I don't have an ETA.
>
> >
> > It has no published and well defined query syntax (there's
> > patches here,
> > bits there, but no definitive document on how to use it, how to batch
> > with it, how to perform complex and structured queries, particularly
> > with TextIndex'es)
>
> Hopefully, the Zope book will make it more clear from a user
> perspective, and sometime in the very distant future I will be writing a
> chapter in the developer's guide about the catalog.  I agree that the
> ZCatalog wrapper should probably wrap more of the underlying catalog's
> methods, but these need to be rationalized, defined and then documented
> in the API docs.  This is something for dev.zope.org, probably.
>
> >
> > Don't get me wrong, it is very cool, but only kindof 70% there :S
> > (and I get the impression that doing the remaining 30% properly would
> > require a rewrite...)
> >
> > As an example, we've been trying to do Zope-based versions of the
> > mailing list archives for a coupla months now and the Catalog keeps
> > exploding in different ways (huge resource consumption, even for only
> > 30K messages or so, no matter what storage is used)
>
> Yes.  Tweaking and the broadtree stuff should make this a little better.
>
> >
> > Then there's the ubiquitous 'KeyError's and other associated
> > weirdness,
> > all of which leaves me feeling a lot less than totally
> > confident in the
> > Catalog ;-)
>
> These are independent of the coupling problem in the btree, and I'm
> trying to vanquish them now.  This is why the annoying logging code was
> added to the catalog.  It seems to be related to the TextIndex
> implementation, but I'm still trying to pin it down.
>