[Zope] Strip all HTML

ken at practical.org ken at practical.org
Thu Aug 7 15:22:26 EDT 2003


Chris Withers wrote:
>Are there any other tags where the content should be removed?

AFAICT, the HTML elements which need to be removed together with their content are: style, script, noscript and noframes. At least those are the most common non-proprietary ones.

My strategy was to transform the opening tag into '<!--' and closing one into '-->', and then get rid of '<!--.*?-->', but there must be a more clever way.

I would love to have a fix for Dieter's CatalogSupport.py, since that module was intended for my first use case: to prevent indexing of irrelevant markup; it is already used by the DocumentLibrary product.

My other use case, the display of a text-only version of a web page, also requires removal of all markup and markup-related content.

Is there a reason for any of the HTML conversion modules *not* to incorporate this addition? I am just surprised that no one has reported it as a problem. Thanks to those who are contributing to this thread!

Ken





More information about the Zope mailing list