[Zope] ZCTextIndex - prefix wildcards not supported?

Andreas Jung lists at zopyx.com
Thu Jun 24 01:44:23 EDT 2004


TextIndexNG2 supports "*term" queries.

-aj

--On Montag, 21. Juni 2004 9:49 Uhr -0400 Small Business Services 
<toolkit at magma.ca> wrote:

> Hi Casey,
>
> I am trying to implement your suggestion of accessing the '_docwords'
> structure in an attempt to eliminate duplicate storage of data in the
> ZCatalog.
>
> I have created a test external method to retrieve the _docwords entry for
> a specific object in an existing ZCatalog:
>
> def jtmp(self):
>    res = self.Catalog({'id' : '1086793690.85'})
>    for item in res:
>       rid = item.data_record_id_
>    return
> self.Catalog.getIndex('all_searchable_text').getEntryForObject(rid)
>
>
> Executing this external method gives me a zope error:
>
> Traceback (innermost last):
>   Module ZPublisher.Publish, line 98, in publish
>   Module ZPublisher.mapply, line 88, in mapply
>   Module ZPublisher.Publish, line 39, in call_object
>   Module Products.ExternalMethod.ExternalMethod, line 224, in __call__
>    - __traceback_info__: ((<Folder instance at a063d58>,), {}, None)
>   Module /apps/zope/Extensions/jtmp.py, line 13, in jtmp
> AttributeError: getIndex
>
> I am confused (being a relative python newbie) because 'getIndex' and
> 'getEntryForObject' are functions defined within the Catalog class, so
> shouldn't they be available?!
>
> Is there a better way to go about this?
>
> Thanks,
>
> Jonathan
>
>
> ----- Original Message -----
> From: "Casey Duncan" <casey at zope.com>
> To: "Small Business Services" <toolkit at magma.ca>
> Sent: November 21, 2003 4:28 PM
> Subject: Re: [Zope] ZCTextIndex - prefix wildcards not supported?
>
>
>> On Fri, 21 Nov 2003 14:08:08 -0500
>> "Small Business Services" <toolkit at magma.ca> wrote:
>>
>> > The Zope Cache size is set at 10,000
>> >
>> > There are 1,985,183 objects in the 'database'
>>
>> Hmm, that's less then I would have thought.
>>
>> > Specifications for our update linux box:
>> >
>> >    Zope 2.6.1
>> >    1 ghz PIII
>> >    1.25 Gb RAM (pc133)
>> >    3 disks (IBM ultrastar, scsi, ultra2mode - 10,000 rpm, 4.5ms access)
>> >
>> > We are running the disks striped on a single controller, which gives us
>> > amazing read/write capacity.  We rarely run at full capacity on the
> disks.
>> > We set the cache at the highest point possible (any higher and the
> machine
>> > swaps itself to death).
>>
>> I think you could definitely use more RAM. But that is a given pretty
> much. How big is the Data.fs file when you're through indexing? How does
> that compare to the size of the document corpus itself?
>>
>> Also I think you may want to try Zope 2.6.2. I made some changes to
> ZCTextIndex in that version that could help performance. I would be
> interested to hear if they help.
>>
>> [snip]
>> > We eventually came up with our current solution: at index time we
> compress
>> > the full-text and store it as binary data in the metadata table
>> > (getting this to work was a challenge in itself).  We then decompress
>> > and scan
> this
>> > data to locate the relevant 2-3 lines at retrieval time (it is far
> faster to
>> > decompress & scan metadata then to access the objects directly).
>>
>> Using metadata tends to wake up far fewer objects, which can be a win.
> Interestingly ZCTextindex actually stores a similar compressed word list
> internally. The actual index object stored in ZCTextIndex has an _docwords
> BTree which stores a compressed wordlist for each document. This is used
> for unindexing and phrase matching. Look at the search_phrase method in
> BaseIndex.py for for info.
>>
>> If you could use _docwords, you might be able to get rid of that
>> redundant
> data structure and the time it takes to build and store it. Retrieval time
> should be on par with metadata.
>>
>> > Retrieval speeds for end users are excellent.  We have only been
>> > running into difficulties lately because of the size of the database.
>> > The
> update
>> > process now runs 24 hours per day for about 30 days (automating an
> update
>> > process that runs for 30 days was another exciting challenge!).  The
> fact
>> > that zope can handle this volume of processing is a testament to its
>> > reliability and robustness!
>>
>> I'm concerned that it takes that long to index. 30 days is like a
> millenium of processor time. I'm curious how big your transactions are
> during index processing.
>>
>> I'm glad to see the retreival speeds are good. What roughly is the
>> average
> document size?
>>
>> > We have been working with Zope for about 3 years and think that it is a
>> > FANTASTIC product!  We keep coming up with new things to use it for,
>> > its great!
>> >
>> > Thanks in advance for any ideas you may have - we are open to any and
> all
>> > suggestions!
>>
>> Sounds like you have a very interesting application. I'd be very
> interested to hear about and possibly try to help make it faster if I can.
>>
>> -Casey
>
>
>
> _______________________________________________
> Zope maillist  -  Zope at zope.org
> http://mail.zope.org/mailman-20/listinfo/zope
> **   No cross posts or HTML encoding!  **
> (Related lists -
>  http://mail.zope.org/mailman-20/listinfo/zope-announce
>  http://mail.zope.org/mailman-20/listinfo/zope-dev )



Andreas Jung
zopyx.com - Software Development and Consulting Andreas Jung


More information about the Zope mailing list