[Zope-Dev] Some thoughts on splitter (Sin Hang Kin)

Christian Wittern chris@ccbs.ntu.edu.tw
Fri, 21 Apr 2000 14:10:23 +0800


After giving it some thought over the past few days, I came up with some
more things re Splitter and Catalog searching in general. I will first post
them here and see what feedback people might have and then put them into the
WIKI.

As was pointed out repeatly, words, word-boundaries and the like do not
exist in the same way as in Western languages in some Asian languages (or
writing systems). One way to overcame problems associated with
word-splitting is to do no word splitting at all and instead split on every
character.

As soon as ZCatalog starts using Unicode, this could even be incorporated in
the default Splitter, which could be told to do word splitting on some
character ranges and character splitting on others.

It seems to me, that this is the approach generally used on the Web by Asian
language search engines.

To accomodate this, there have to be some changes to the way searches are
done as well: On most search engine, giving a few search terms separated by
whitespace means ANDing them for the search, which is fine. If this is not
desired however, most search engines allow the user to use quotes to
indicate the terms should be used as a phrase. Unfortunately, Zope does not
support this yet. I think it is highly desirable!!!

If ZCatalog would support this type of search, this could be used for Asian
languages and searches would return results where to or more characters are
searched for, by looking for documents, where they occur in sequence.

Does this make any sense?

All the best,

Christian