[Zope-CMF] Re: ZCSearchPatch

Wed, 14 May 2003 10:47:09 -0400

Actually this should be very easy to fix, see inline comment below:

On Wednesday 14 May 2003 10:36 am, Eric Dunn wrote:
> ZCatalog issue:
> Have code to strip out html tags so that the ZCatalog
> does not pick up the html code when catalogging.
> Works great... almost too good.
> Our users are only copy-n-paste managers.
> I found that stripping the "&nbsp;" (html space tag)
> makes the catalog concantenate text... i.e.
> 1234 1234 1234 1234 becomes '1234123412341234' in the
> catalog.
>=20
> Question: How can I tell the SearchPatch.py file to
> ignore the space tag or treat it as a space?
>=20
>=20
> import re
> from SearchIndex.UnTextIndex import UnTextIndex
> from string import find
>=20
> # HTML regex to substitute tags and entities
> html_re =3D re.compile(r'<[^\s0-9].*?>|&[a-zA-Z]*?;',
> re.DOTALL)
>=20
> class FauxDocument:
>     """Proxy document to store munged source text"""
>     def __init__(self, name, value):
>         setattr(self, name, value)
>=20
> # Get a reference to the original index_object method=20
> # so we can head patch it
> original_index_object =3D UnTextIndex.index_object
>=20
> def index_object(self, documentId, obj,
> threshold=3DNone):
>     # sniff the object for our 'id', the 'document
> source' of the
>     # index is this attribute.  If it smells callable,
> call it.
>     try:
>         source =3D getattr(obj, self.id)
>         if callable(source):
>             source =3D str(source())
>         else:
>             source =3D str(source)
>     except (AttributeError, TypeError):
>         return 0
>        =20
>     if find(source, '<') !=3D -1:
>         # Strip HTML tags and comments from source
>         source =3D html_re.sub('', source)

Change the above line to:

         source =3D html_re.sub(' ', source)

(Insert a space between the single quotes)

>         # Create faux document with stripped source
> content
>         obj =3D FauxDocument(self.id, source)
>        =20
>     # Call original index method
>     return original_index_object(self, documentId,
> obj, threshold)
>=20
> # Patch UnTextIndex class
> UnTextIndex.index_object =3D index_object

Hope that helps,

-Casey