[Zope] Malicious HTML

Duncan Booth duncan@rcp.co.uk
Wed, 8 Mar 2000 17:31:59 +0000


> Graham Chiu wrote:
> > 
> > I wish to allow users to enter comments into my database which are then
> > viewable thru the browser.
> > 
> > Is there a Zope function that I can pass their text thru to remove all
> > HTML?
> 
> Nope.  There may be some standard python library module that I don't
> know about, however.  Otherwise, you will have to write your own.
> 
> -Michel
> 
Try htmllib.

The following bit of python will strip all formatting from some HTML. It 
replaces all anchors with footnote style references and images with 
their alt text. If you want something a bit fancier you could add 
methods to the MyParser class to pass through particular tags (see 
the commented out methods as an example). It shouldn't be too hard 
to wrap something like this up in an external method (as presented it 
is a complete runnable program that retrieves a URL and displays 
the text).

--------- File strip.py --------------
# Strip all HTML formatting.
import sys,formatter,StringIO,htmllib,string
from urllib import urlretrieve,urlcleanup

class MyParser(htmllib.HTMLParser):
    def __init__(self):
        self.bodytext = StringIO.StringIO()
        writer = formatter.DumbWriter(self.bodytext)
        htmllib.HTMLParser.__init__(self,
            formatter.AbstractFormatter(writer))

    def gettext(self):
        return self.bodytext.getvalue()

# Uncomment these to pass through bold tags.
#    def start_b(self, attrs):
#       self.formatter.add_flowing_data('<b>')
#
#    def end_b(self):
#        self.formatter.add_flowing_data('</b>')

def GetPage(url):
    try:
        fn, h = urlretrieve(url)
        text = open(fn, "r").read()
    finally:
        urlcleanup()
    return text

if __name__=='__main__':
    data = GetPage(sys.argv[1])
    p = MyParser()
    p.feed(data)
    p.close()
    text = string.replace(p.gettext(), '\xa0', ' ')
    print text
    anchors = p.anchorlist
    for i in range(len(anchors)):
        print "[%d]: %s" % (i+1, anchors[i])
--------- end of strip.py --------------

-- 
Duncan Booth                                             duncan@dales.rmplc.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
http://dales.rmplc.co.uk/Duncan