[Zope] Search with partial words on own ZClass

Michel Pelletier michel@digicool.com
Tue, 04 Apr 2000 13:44:20 -0700


Rik Hoekstra wrote:
> 
> >
> >
> >Partial searching is available in the CVS.
> >
> 
> Michel,
> 
> Is that just partial searching, or also wildcard and even regexp searching?

I define 'Partial' and 'wildcard' as the same thing, but I'm using my
own terminology so I could be wrong.  I define partial as 'finding part
or all of a word', which can be acomplished with wildcards '*part*'.

The CVS supports '*' and '?' wildcard characters (the actual character
used is configurable, in case you really want to keep those question
marks).  This involved creating a new kind of Lexicon called a
GlobbingLexicon ('Globbing' is the use of * or ? to match patterns, for
those who didn't know...).  The GlobingLexicon is quite flexible and
nice, the only disadvantage over a regular Lexicon (which does no
partial seraching) is that it consumes about three times as much memory
since each word is split into bi-grams and indexed in a mini 'lexical
index'.  It's really quite simple.  Take the words 'flexible' and
'fleece'.  When these words are added to a Globbing Lexiocon, they are
turned into::

  flexible ->  ['$f', 'fl', 'le', 'ex', 'xi', 'ib', 'bl', 'le', 'e$']

  fleece  ->  ['$f', 'fl', 'le', 'ee', 'ec', 'ce', 'e$']

('$' indicates the beginning or end of a word)

Each 'bi-gram' is indexed against the words that that bi-gram occours
in:

'$f' -> ['flexible', 'fleece']
'fl' -> ['fiexible', 'fleece']
'le' -> ['flexible', 'fleece']
'ex' -> ['flexible']
...
...
'e$' -> ['flexible', 'fleece']

When you search for 'fle*'  The Lexicon's query engine turns your query
into::

'fle*' -> ['$f', 'fl', 'le']

and then looks in the lexical index for words that contain those three
bi-grams.  It is possible for the word 'falafle' (no doubt wrongly
spelled) to contain those three lexicons, and possible false matches
like that are weed out and the end, this is efficient, becauase at this
point we have discarded all but a few words in the lexicon.

Regular expressions are not feasiable in any searching system.  Although
it may be possible, with the existing lexical analysis that globbing
lexicons do, to implement a larger subset of regexp than just * and ?,
it is not feasable to implement the entire regexp language.

> And since you keep locations of the words, is there proximity searching also
> possible?

The location in the document is not kept, just the score.  There are
TextIndex methods however for finding the positions of words in a
document, this is used to support the 'Near' operator, which is '...' 
This operator exists in TextIndexes now (it allways has, since I took
over the indexing realm), I tested them a few months ago but couldn't
get the concept to work.  I suspect it's buggy, the code holds over from
ZTables.
> 
> Another question: how do I retrieve a list of unique words from a full-text
> catalog? 

In 2.1, you need to hack the lexicon from Python.  In 2.2, you call a
Vocabulary object's 'words' method, or you can call the Vocabulary with
a pattern '*' to match all words, or a more restrictive pattern if you
only want all the unique words that match a pattern, like '*ing', all
the words that end in ing.

> Now, I know there is no standard way, but is it possible at all.

In 2.2 it is standard (and documented in the Interfaces Wiki).

> Can I use the items, keys etc interfaces of the text index (perhaps with
> some python hacking)?

TextIndexes do not store the word, they store an integer that the
lexicon maps to a word.  This is so text indexes can be language
independent.

-Michel