[ZPT] OT (and probably a bit long ;-) HTML Filtering

Chris Withers chrisw@nipltd.com
Wed, 16 May 2001 12:25:37 +0100


Hi :-)

I'm now onto my fourth or fifth mailing list here but I think this is finally
the right list, even if this may seem a bit off topic :-)

I have a python module called Strip-O-Gram
(http://www.zope.org/Members/chrisw/StripOGram) which is supposed to take dodgy
HTML and filter it, closing any open tags, removing JavaScript, etc.

This was originally written for Squishdot (http://www.squishdot.org) but other
people seem to be finding it useful now so I'm trying to make it work like it
should :-)

Anyway, J M Cerqueira Esteves <jmce@artenumerica.com> found some problems with
it and reported them on the Zope list:

> >   html2safehtml ('Roses <b>are</B> red,<br/<blink>QUACK<//blink> violets '
> >                  '<i>are</i> blue',
> >                  valid_tags=['b','i','br'])
> >
> > successfully smuggling a <blink>...</blink> inside the result:
> >
> >        'Roses <b>are</b> red,<br><blink>QUACK</blink> violets <i>are</i> blue'
> >
> > (Notice that the closing '</i>' is now OK again, and that I had to use
> > '<//blink>' in order to get '</blink>'.

The problem here seems to be with the parser in sgmllib.py:

> When parsing the following HTML: 
> 
> 'Roses <b>are</B> red,<br/>violets <i>are</i> blue' 
> 
> ...with the following class: 
> 
> class HTML2SafeHTML(sgmllib.SGMLParser): 
> 
> def handle_data(self, data): 
>         print "***data***" 
>         print data 
> 
> def unknown_starttag(self, tag, attrs): 
>         print "***start**" 
>         print tag 
>         print (attrs) 
> 
> def unknown_endtag(self, tag): 
>         print "***end**" 
>         print tag 
> 
> I get the following output, which isn't right :-S 
> 
> ***data*** 
> Roses 
> ***start** 
> b 
> [] 
> ***data*** 
> are 
> ***end** 
> b 
> ***data*** 
> red, 
> ***start** 
> br 
> [] 
> ***data*** 
> >violets <i>are< 
> ***end** 
> br 
> ***data*** 
> i> blue 

(sorry for that being so long...)

Anyway, Ethan pointed out that you guys have probably got quite good at this
sort of thing while developing ZPT...

So, how should I be approaching this problem?

many thanks for any help,

Chris