[Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

Jim Fulton jim@zope.com
Thu, 15 Aug 2002 09:21:34 -0400


The original reason to share vocabularies was that multiple fields
often came from the same human "vocabulaties". The idea was that vocabularies
would encompass a number of features including:

- Words (or n-grams) used

- Synonyms

- Stemming rules

- Stop words

- Splitting rules

There was, potentially, a lot of information to be shared and it would
often be important, for consistency to share the same rules for different
fields that contained the same sort of content. Sharing had as much
to do with using consistent rules than it did with optimization.

Unfortunately, the old text index never implemented a lot of these ideas. :(

The pipe-lining model used by ZCTextIndex moves some of this functionality
out of the lexicon and leaves some of these ideas unimplemented, as did
TextIndex.

I think that there is at least potential value in sharing lexicons.
Of course, a down side is that it complicates set up.

On the subject of referencing lexicons by path rather than using direct
references, I'm inclined to agree that direct references are better for
simplicity and speed. It's easy enough to add a new index when you
want to change a lexicon. (Well, there are some complications having to do
with making sure that you get all the needed data into the new index...)

Jim


Casey Duncan wrote:
> On Wednesday 14 August 2002 06:03 pm, Guido van Rossum wrote:
> 
>>>Fix for issue #505
>>>ZCTextIndex is now associated by path to its lexicon. After replacing a 
>>>
> lexicon used by an index, clear the index to make it use the new lexicon.
> 
>>So the semantics are that when you replace the lexicon, the index is
>>reset to empty, right?  Why not create a new index instead?  Then the
>>lexicon could be internal to the index.  Sharing lexicons doesn't
>>sound like a probable use case, the more I think about it.
>>
>>--Guido van Rossum (home page: http://www.python.org/~guido/)
>>
>>
> 
> I don't disagree. This was a conceptual holdover from the previous generation 
> TextIndex. I'm switching this over to zope-dev for wider discussion:
> 
> The current implementation of ZCTextIndex is like the old TextIndex in that 
> you can create one Lexicon (the sucessor to Vocabularies) shared by multiple 
> ZCTextIndexes.
> 
> I imagine the thought was that there are only a finite number of words and 
> that sharing the lexicon would save space and possibly index time, since a 
> given word would only need to be inserted once into the lexicon regardless of 
> the number of indexes it occurred in. More significant might be the (cache) 
> memory savings of only having to keep one copy of the words in memory across 
> several indexes. Plus fewer loads and stores to the database overall by 
> sharing the word list.
> 
> On the other hand I think query speeds may be compromised since one large 
> lexicon would take longer to search for a given word (or words) then several 
> smaller ones. This would be especially true for small indexes sharing a 
> lexicon with a much larger one.
> 
> The other downside (as illustrated by issue #505) is the complication of 
> linking index to lexicon and making the link manageable so that you can tweak 
> the indexing system easily. My fix is not entirely complete because a hard 
> ref to the lexicon is still stored in the low-level index (to which the 
> ZCTextIndex class delegates). In order to fix this effectively without 
> introducing Zope dependancies at the low level (which we have looked to 
> avoid) I would need to create some sort of Lexicon proxy that can access the 
> correct lexicon on demand by a path efficiently. This proxy would be 
> referenced by the low level index in place of the actual lexicon.
> 
> Of course the other solution, which is much simpler is to dispense with this 
> notion of sharing lexicons entirely and as Guido suggests, just make the 
> lexicon part of the index.
> 
> Without hard use cases to the contrary, I lean toward that simpler design. 
> However I would like to perform some additional testing on large corpuses 
> with many indexes to assess the memory/performance differences between these 
> two approaches. We have already ascertained that with the new ZODB cache code 
> in 2.6, the cache setting can have a profound affect on query performance 
> (like a factor of 10), so I think testing would be helpful.
> 
> Anyone care to weigh in with use cases for shared lexicons?
> 
> -Casey
> 
> _______________________________________________
> Zope-Dev maillist  -  Zope-Dev@zope.org
> http://lists.zope.org/mailman/listinfo/zope-dev
> **  No cross posts or HTML encoding!  **
> (Related lists - 
>  http://lists.zope.org/mailman/listinfo/zope-announce
>  http://lists.zope.org/mailman/listinfo/zope )
> 
> 



-- 
Jim Fulton           mailto:jim@zope.com       Python Powered!
CTO                  (888) 344-4332            http://www.python.org
Zope Corporation     http://www.zope.com       http://www.zope.org