[Zope-dev] Shared lexicons for ZCTextIndex (was: Re: [Zope-Checkins] CVS: Zope/lib/python/Products/ZCTextIndex - ZCTextIndex.py:1.32)

Casey Duncan casey@zope.com
Wed, 14 Aug 2002 23:23:20 -0400


On Wednesday 14 August 2002 06:03 pm, Guido van Rossum wrote:
> > Fix for issue #505
> > ZCTextIndex is now associated by path to its lexicon. After replacing=
 a=20
lexicon used by an index, clear the index to make it use the new lexicon.
>=20
> So the semantics are that when you replace the lexicon, the index is
> reset to empty, right?  Why not create a new index instead?  Then the
> lexicon could be internal to the index.  Sharing lexicons doesn't
> sound like a probable use case, the more I think about it.
>=20
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>=20

I don't disagree. This was a conceptual holdover from the previous genera=
tion=20
TextIndex. I'm switching this over to zope-dev for wider discussion:

The current implementation of ZCTextIndex is like the old TextIndex in th=
at=20
you can create one Lexicon (the sucessor to Vocabularies) shared by multi=
ple=20
ZCTextIndexes.

I imagine the thought was that there are only a finite number of words an=
d=20
that sharing the lexicon would save space and possibly index time, since =
a=20
given word would only need to be inserted once into the lexicon regardles=
s of=20
the number of indexes it occurred in. More significant might be the (cach=
e)=20
memory savings of only having to keep one copy of the words in memory acr=
oss=20
several indexes. Plus fewer loads and stores to the database overall by=20
sharing the word list.

On the other hand I think query speeds may be compromised since one large=
=20
lexicon would take longer to search for a given word (or words) then seve=
ral=20
smaller ones. This would be especially true for small indexes sharing a=20
lexicon with a much larger one.

The other downside (as illustrated by issue #505) is the complication of=20
linking index to lexicon and making the link manageable so that you can t=
weak=20
the indexing system easily. My fix is not entirely complete because a har=
d=20
ref to the lexicon is still stored in the low-level index (to which the=20
ZCTextIndex class delegates). In order to fix this effectively without=20
introducing Zope dependancies at the low level (which we have looked to=20
avoid) I would need to create some sort of Lexicon proxy that can access =
the=20
correct lexicon on demand by a path efficiently. This proxy would be=20
referenced by the low level index in place of the actual lexicon.

Of course the other solution, which is much simpler is to dispense with t=
his=20
notion of sharing lexicons entirely and as Guido suggests, just make the=20
lexicon part of the index.

Without hard use cases to the contrary, I lean toward that simpler design=
=2E=20
However I would like to perform some additional testing on large corpuses=
=20
with many indexes to assess the memory/performance differences between th=
ese=20
two approaches. We have already ascertained that with the new ZODB cache =
code=20
in 2.6, the cache setting can have a profound affect on query performance=
=20
(like a factor of 10), so I think testing would be helpful.

Anyone care to weigh in with use cases for shared lexicons?

-Casey