[Zope] zcatalog -- returning context of hits on fulltext

R. David Murray bitz@bitdance.com
Mon, 14 Aug 2000 23:10:27 -0400 (EDT)


On Mon, 14 Aug 2000, Jimmie Houchin wrote:
> I may be clueless and out of my league here and I haven't read the
> sources so I don't know... Well enough of a disclaimer. :)

I *have* read the ZCatalog/SearchIndex sources, but I don't understand
this part of it yet (or really that much of it at all!).  I think
we're getting into zope-dev terratory here...

> Is there anything in there which can provide the seek or byte position
> of the hit within text object? If so, it shouldn't be too difficult to
> read X bytes before and after the position and thereby provide what your
> looking for.

The standard TextIndex implementation records a notion of "position" for
each occurence of each word indexed.  I *think* this position is a word
count position, but I'm not sure.  Part of the code references a
'row', but it isn't at all clear that that has any relationship to
a source record.  If it is a word count, the other thing you'd need to
check would be whether it is a word count before or after splitter
activity.  I think it's the latter, which makes things more complicated.
Or just means you have to use more fuzz in your context <grin>.

> This would be nice to have out of the box.

The TextIndex 'position' information is intended to be used for
the 'near' operator (...) (so you can search on multiple words
"close" to each other for some definition of close).  You could
also use it to enforce word order (Maybe the "" operator does
that?).  Currently I think the result of applying the near operator
is used to adjust the "weight" of the index match, which affects
the order of the results returned.  (I haven't tested to see if
any of this works!)

So, the basic information you are looking for is there in some sense
to establish the position, but you'd still have to retrieve the
original sentences from the object itself, or from a full-text
metadata field.  Both of these are going to be memory intensive
operations.  If you index based on, say, individual lines, you'd
loose some of the the benefits of the near operator, though.  So
I'd say indexing based on paragraphs would probably be your best
approach.  This would also help mask position errors introduced if
the word count is indeed post-splitter.  Of course, you'll have to
decend to python to get access to the methods that will return the
actual position information.  But at least the code to record it
is already there.

Take a look at lib/python/SearchIndex/TextIndex.py for source
enlightenment.

--RDM