[Zope] Vocabulary and stop words

R. David Murray bitz@bitdance.com
Tue, 22 Aug 2000 23:31:55 -0400 (EDT)


On Tue, 22 Aug 2000, Andy McKay wrote:
> I havent been able to the find TFM to read on Vocabulary and stop words in ZCatalog. I need to search by stuff such as XML::Parser and I think I need to patch 2.2 to do it. But a FM would help. Can anyone point me that way.

I don't think there is one.  Basically, if you want to search on terms
that include punctuation, you have to write your own Splitter.c.
Have fun <wry grin>.  You don't by the way, have to write it
in C, although not doing so presumably has performance implications
or it wouldn't have been written in C to begin with.  But if all
you want the splitter to do is split at blanks and truncate long
words, you probably don't need C...

You are presumably talking about text indexes if you are worried
about Vocabulary and stop words.  Most of the guts of this stuff
is actually located in a module named SearchIndex.  Reading the
source code there is as close to a FM as I think you'll get right
now.  The current Vocablulary does some appropriate wrapping up
of modules in SearchIndex for use by Catalog; if you want to do
your own thing without touching Zope's default machinery you'll
need to write your own Vocabulary object.  It shouldn't be
too hard if you model it after the existing source.

The current text index *does* try to do something sensible in the
case you cite, however.  Words are indexed after being broken at
punctuation.  When a word containing embedded punctuation is used
as a search term, it is turned into a "near" search (xml near
parser, for example).  I have not tested whether or not this actually
works, but from my reading of the code I *think* what it does is
equivalent to an 'and' search on the two words except that the
nearer the two words are in the document the earlier in the result
set the document appears (assuming you don't sort the result set
yourself).

Note that there was a longstanding bug in the search term parsing
machinery that caused some search terms with embedded punctuation
to fail to return any results.  I submitted a patch for this that
has been incorporated as of 2.2.1b1.  (The bug should not have
affected a search term like XML::Parser.)

In theory, I think that instead of rewriting the splitter module
you could rewrite SearchIndex/ResultList's notion of what 'near'
means to constrain the words to be right next to each other.  You
should even be able to enforce ordering.  If it works, it might be
a easier than rewriting the splitter, since you'd only be changing
one python function.

I've been digging around in the SearchIndex code for a while now,
so if you want to ask me more questions, go ahead.  It doesn't mean
I'll know the answers, but I'm happy to share whatever I *have*
learned.

--RDM

PS: this question is really more into 'zope-dev' terratory than
'zope' terratory, if you want to move it.