[Zope] Strip all HTML

ken at practical.org ken@practical.org
Tue Aug 5 13:26:54 EDT 2003


Hi all,

I want to display a text-only version of a web page captured with the DocumentLibrary product (no longer supported).

This product uses the 'Catalog Support' HTML converter available here:

http://www.dieter.handshake.de/pyprojects/zope/CatalogSupport.html

However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...

Has anyone else confronted this problem?

I have also made feeble attempts such as the following Script (Python), without success:

import string
import re

text = re.sub('<STYLE.*?>.*?</STYLE>', '', data)
text = re.sub('<STYLE.*?>.*?</STYLE>', '', text)
text = re.sub('<style.*?>.*?</style>', '', text)
text = re.sub('<script.*?>.*?</script>', '', text)
text = re.sub('<!--.*?-->', '', text)
text = re.sub('<.*?>', ' ', text)
return text

I sure would appreciate some help on this...

Thanks,

Ken






More information about the Zope mailing list