[Zope] A new python object which analyse HTML files and...

Andrew Wilcox circle@gwi.net
Mon, 29 May 2000 12:40:30 -0400


At 06:17 PM 5/29/00 +0100, Frederic QUIN wrote:
>Hello everybody,
>
>I would like to create a python object which :
>* analyse traditional HTML files
>* indexe "IMG" tags and "A" tags
>* replace "IMG" tags by appropriated dtml tags
>* replace "A" tags by appropriated dtml tags
>* create all the resultant objects
>
>Did someone ever do that ? Anyway, if someone with more experience than me,
>have some advices, I'll get them...
>
>Thanks
>Frederic

You might try sgmllib, a sgml parser from the Python library.  (There's
also an HTML parser called htmllib derived from sgmllib, but it tries to
understand all the common HTML tags such as H1, etc.  If you just want to
pass through all tags except all couple specific ones, I found it easier to
use sgmllib).

The sgmllib library, but doesn't produce output, so the hooks you'd add
would be ones to recreate your HTML files from the results of the parsing.
Something like the following:

class MyParser (SGMLParser):

    def __init__(self):
        SGMLParser.__init__(self)
        self._result = ''
        
    def _write(self, data):
        self._result = self._result + data
        
    def getResult(self):
        return self._result

    def unknown_starttag(self, tag, attributes):
        r = '<' + tag

        for attribute in attributes:
            (name, value) = attribute
            r = r + ' ' + name + '="' + value + '"'

        r = r + '>'
        self._write(r)

    def unknown_endtag(self, tag):
        self._write('</' + tag + '>')

    def handle_data(self, data):
        self._write(data)
        
    def handle_charref(self, ref):
        self._write('&#' + ref + ';')

    def handle_entityref(self, ref):
        self._write('&' + ref + ';')

    def handle_comment(self, comment):
        self._write('<!--' + comment + '-->')


Then you can add specific handlers to do special things with particular tags:

    def do_img(attributes):
       ...write out special DTML code here...