[Zope-CMF] ZCSearchPatch

Eric Dunn endunn@rocketmail.com
Wed, 14 May 2003 07:36:29 -0700 (PDT)


ZCatalog issue:
Have code to strip out html tags so that the ZCatalog
does not pick up the html code when catalogging.
Works great... almost too good.
Our users are only copy-n-paste managers.
I found that stripping the " " (html space tag)
makes the catalog concantenate text... i.e.
1234 1234 1234 1234 becomes '1234123412341234' in the
catalog.

Question: How can I tell the SearchPatch.py file to
ignore the space tag or treat it as a space?


import re
from SearchIndex.UnTextIndex import UnTextIndex
from string import find

# HTML regex to substitute tags and entities
html_re = re.compile(r'<[^\s0-9].*?>|&[a-zA-Z]*?;',
re.DOTALL)

class FauxDocument:
    """Proxy document to store munged source text"""
    def __init__(self, name, value):
        setattr(self, name, value)

# Get a reference to the original index_object method 
# so we can head patch it
original_index_object = UnTextIndex.index_object

def index_object(self, documentId, obj,
threshold=None):
    # sniff the object for our 'id', the 'document
source' of the
    # index is this attribute.  If it smells callable,
call it.
    try:
        source = getattr(obj, self.id)
        if callable(source):
            source = str(source())
        else:
            source = str(source)
    except (AttributeError, TypeError):
        return 0
        
    if find(source, '<') != -1:
        # Strip HTML tags and comments from source
        source = html_re.sub('', source)
        # Create faux document with stripped source
content
        obj = FauxDocument(self.id, source)
        
    # Call original index method
    return original_index_object(self, documentId,
obj, threshold)

# Patch UnTextIndex class
UnTextIndex.index_object = index_object


=====
Eric N. Dunn
other email: endunn@aol.com

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com