[Zope] Weighing catalog searches per index ?

Fri Jan 9 18:00:11 EST 2004

Casey,

Ahhh, so it just multiplies the score.  Which also means the scoring is
applied to each field, instead of merging the fields and THEN scoring.  But
doesn't that mean that even if not restricting the search to specific
fields, the scrow coming out of one of our indexes could be different than a
pure ZCTextIndex which scores on just one big "blob" of textual content,
instead of several small ones ... At least with Okapi that would presumably
make a difference sincepart of the cosring is based on the totoal number of
words in the document ?

But then that's not too big a deal, so long as whatever differences are
explained/documented :) I can help with that if you'd like.

As for the syntax of the querying, I'm really indifferent, so long as it
works :) I guess your suggestion does have advantages over mine indeed
though !

Thanks for getting this done ! Let me know as soon as you've got it and I'll
gladly try it out.

Since this can be made into a transparent extension of ZCTextIndex, I'd
really suggest that if/when this is deemed mature enough, it replace the
current ZCTextIndex.  This searching fucntionalty is kind of invaluable and
extremely powerful, and I'm sure would be of great use to many once they
find out about it !

Thanks for the great help!
J.F.

-----Original Message-----
From: Casey Duncan [mailto:casey at zope.com]
Sent: Thursday, January 08, 2004 4:54 PM
To: Jean-Francois.Doyon at CCRS.NRCan.gc.ca
Cc: zope at zope.org
Subject: Re: [Zope] Weighing catalog searches per index ?

On Thu, 8 Jan 2004 16:24:58 -0500 
Jean-Francois.Doyon at CCRS.NRCan.gc.ca wrote:

> Casey,
> 
> Thanks for pointing out this product, I'll have to give it a try, as I
> can foresee many useful applications for it !

Cool. Its new and I'm eager to get feedback from the field on it (no pun
intended).

[...]
> 
> Your product seems to have a good base to start with.  The problem
> now, and one that stopped me in my tracks, is how to
> define/calculate/configure this"weighing" concept.  You suggest
> there's some underlying functionality for weighing already, maybe it'd
> just be a matter of taking advantage of it, and documenting how to use
> it ? The big question would be what does a weight of"1" MEAN versus a
> weight of "2" or "5" ?

ZCTextIndex calculates document and word scores. When queries are
performed these scores are combined as intermediate results are combined
(using unions and intersections). The weighted versions of these
commands allow you to weight one set differently than another. The
weight multiplies the score by some factor as the set operation is
performed.

> The other is how it gets purely implemented.  Does the weight need to
> be known at indexing time, or can it be provided at search time ? My
> hunch is the weighing should be applied at search time, so your
> product could be modified to take as input the weights to apply to
> each index that is being search through ?

Could be done either way. Weighing at index time might be more
efficient, but would not allow different weights to be applied for
different queries. I doubt that query-time weighting would slow things
down at all since it is already being done (only the weight factors are
always 1). All of the set operations are implemented in C.

> Something like:
> 
> result = catalog(dc_fields={"query":"Some search string",
> "fields":["Title","Description"]})
> 
> could become:
> 
> result = catalog(dc_fields={"query":"Some search string",
> "fields":["Title","Description"], "weights":[5,1]})

Sure or maybe:

result = catalog(dc_fields={"query":"Some search string", 
                           "weighted_fields":{'title':5,
'description':1})

This might be slightly less error prone (otherwise you need to match up
the lists}, if slightly less readable. :record marshalling for
weighted_fields could also be supported for queries from web forms.

Either spelling would work though and I'm open to input.

> Meaning apply a weight of 5 to Title, and 1 to Description.  Which I
> would in turn interpret as meaning Title is 5 times more important
> than Description (Not knowing any better right now).

Yes, scores for words found in the title would get multiplied by 5.
Scores for description would get multiplied by 1.

> Personally I'm using the Okapi algorithm.  When I started
> investigating this, I came to the (admitedly uneducated) conclusion
> that to do proper, fast weighing, then the Okapi implementation would
> have to be modified to support this feature (Maybe it does already
> ??), which is over my head, especially with the okascore module being
> Python/C.  Doing it in python would mean doing a second pass over the
> results that have already been scored once, which is innefficient it
> seems, and computationally intensive(Especially as I envision th efact
> that really really nice weighing algorythms would need to have all
> content in memory in order to do relational work between records).

I don't think the scoring algorithm would be affected what you propose.
I'd need to dig in a little deeper to be sure though.

> Anyways, that's what I've been thinking about ... But the benefits of
> having such a beast seem really tentalizing, so I thought I'd ask
> anyways ... Besides maybe I'm way out to left field on this and it's
> easier than I make it out to be ?! :)

I think this is a very compelling addition to the product. I'm going to
look at implementing it this weekend.

Thanks for the idea!

-Casey