[Zope] Re: DirtyWordsFilter

Erik Enge erik@esol.no
18 Feb 2001 18:24:47 +0100


[jasonc@bigfoot.com]

| how do I use ZCatalog to store all the words I want filtered and how
| would I run this script on a string from a form?

That depends on how you want it to interact with your application, I
guess.  But, as an easy way out, you could run the dirtyword-filter
before you add the text to your objects.

If you want to store all the words in ZCatalog you could just index a
bunch of words (bad words, with meta_type set to 'bad_word') and do
searches; (untested!)

aString = 'lots and lots of angry-customer-bad-words'

for word in string.split(aString, ' '):
        for catalog_brain in self.Catalog.searchResults(meta_type='bad_word'):
                bad_word = self.Catalog.getobject(catalog_brain['data_record_id_'])
                if word == bad_word:
                        aString = string.replace(aString, word, '*bleep!*')
return aString


Add this as a Python Script in your Zope instance, make sure it can
reach the ZCatalog (here named Catalog) by acquisition or directly,
and call this script for every string that will be added to your
objects.

There are tons of caveats with this model.  F ex, it doesn't catch up
on words like 'b a d  w o r d'.  You could try to do something
intelligent with it.  Hm.  Someone has probebly done this before you,
in Perl or something, look at their algorithm.

Another thing is that it is highly unoptimized, and could be slow
(relatively speaking).  If you want to use the stop-word list;
lib/python/SearchIndex/Lexicon.py, line 128 (in Zope 2.2.1) has a
method called set_stop_syn(), this might be what you want to use,
although I wouldn't how to.  On closer inspection I see that at the
bottom of that file, a stop_words tuple is defined; a really, really
ugly way to solve this would be to add your words there; around line
191.

Also, to be really sophisticated (albeit, with a larger percentage
change of also bleeping some words that shouldn't be bleeped; ie. they
weren't bad words) you could use the soundex Python module.

If you're doing this to remove stuff like the SCRIPT tag, REQUEST
calls and those kind of things, I wouldn't recommend using soundex -
obviously.

Hope this helps!