[Zope] Re: Ultraseek Content Classification Engine

Michel Pelletier michel@digicool.com
Fri, 01 Oct 1999 14:02:58 -0400


(I cc:ed the Zope list on this because there is some good information
for the community)

Jon Udell wrote:
> 
> This is a nifty concept: you set up rules that map search results into a
> Yahoo-like category tree. You can create these rules interactively or,
> since it's all expressed in XML, you could in principle derive a ruleset
> by some other means and then have Ultraseek CCE use it.

This can be done rather nicely with a general purpose object index. 
Given objects (like documents) with various properties, you can
desegnate a property 'keywords' that defines the set of nodes in a
catagory hierarchy the object fits into.  I envision you could do this
two way, as a set of singleton nodes (so that multiple 'paths' from the
heirarchy root are expressed can be expressed in one keyword, caveat you
can't 'suppress' paths) or as a set of fully delimted paths or 'ordered
keywords'.
> 
> Here's the example that Ultraseek's site refers to:
> <http://search.state.mn.us/>.
> 
> Has anyone used Ultraseek with its CCE? Or a similar system (e.g. Verity
> Topic) that does rule-based mapping of results onto a category tree? I'd
> be curious to hear from someone who's wrestled, using tools like this,
> with a fairly large corpus of documents, and can speak to the issues
> involved in creating/maintaining the mappings from results space into
> category space.

I cannot give you examples with the software you mention, but I can give
you an example that we are working on with ZCatalog.  We are desiging a
'Topic' based system that works as a catagorical hierachy, ala Yahoo. 
ZCatalog is an object index, somewhat identical to what I describe
above.

ZCatalog indexes objects into an arbitrary set of various kinds of
indexes.  Each index is responsible for indexing one particular property
of an object.  If the an object being indexed does not have a property
that an index is looking for, the object is simply not indexed in that
index (but it may have a property that another index is looking for, and
therefore will be indexed in *that* index).

The CVS version of ZCatalog uses three types of index (in 2.0, there are
only the first 2):

  Field Index: property values are treated atomically.  Indexes can be
queried for all objects that match that value.  Range searches can also
be done on indexed object values that support comparison (like numbers,
dates, special purpose 'length' objects, etc).  indexes can also be
queired for the set of unique values in the index, for example, you can
ask for the set of unique 'meta_types' of all objects indexed.  A good
example of this is the search by 'type' on the Zope site
(http://www.zope.org/SiteIndex/searchForm).

  Text Index: property values are applied against a lexicon object that
stems, stops, and parses the value into a full text index.  The index
may be queried with a simple boolean query language that allows 'and'
'or', phrasing, parenthesized boolean expressions, and proximity
matching.  Relevance ranking is supported and returns the sum of the
occurances of all query terms in the 'hit'.  A normalized score is also
provided that is normalized from 0 to 100 over the whole result set.

  Keyword Index: Subclasses all of the field index behavior, except that
property values are treated as a sequence of keywords.

The ZCatalog can work in a UNIX 'find' like fasion, where it spiders
over the object hierarchy indexing objects, or Zope classes may subclass
behavior that makes them Catalog Aware, allowing them to index/unindex
themselves when their state changes.

The new Zope site is driven by ZCatalog.  All of the product listings,
member contributions, news items, links, tips, documentation, how-tos,
etc... are all catalog aware objects that index themselves.  As new
content is added anywhere in the site, all of the various dynamicaly
generated information is updated.  Some user have the propery access
credentials to review or immediatly submit new content, other users must
submit content for review before it is cataloged.  On the bottom of
every screen, there is a 'DTML Source' link that shows you the DTML that
generated that page.  There, you can see the various clever ways that
the Catalog is used throughout the site.

-Michel



> 
> --
> Jon Udell | <http://udell.roninhouse.com/> | 603-355-8980