[Zope] DTML, Zope and Regex

Duncan Booth duncan@rcp.co.uk
Fri, 12 Jul 2002 14:12:09 +0100


[My first attempt to send this failed to post it to the list]

Kirk Lowery <klowery@wts.edu> wrote in news:3D2B0727.1030401@wts.edu:

> Ben, would you mind expanding on this? What dangers are there? Regexes
> are so handy, and if I turn them on I'd like to know what the risks
> are... 

Regular expressions can easily lead to unbounded CPU usage. Provided you 
don't let any untrusted users edit anything that can call regular 
expressions, and provided that you fully understand the issues involved, 
then you may be able to use regular expressions.

If the regular expression comes from an untrusted source, or if you don't 
think it through carefully enough, then you may find that you can 
effectively block one thread of your Zope server. For the sort of use you 
proposed, this is unlikely to happen. The problem is that it is almost 
impossible to work out in advance which regular expressions will cause 
problems.

Try this at a Python interactive prompt:
>>> import re
>>> s = 'a'+'b'*1024+'c'
>>> re.match('a.*.*b.*c',s)
>>> re.match('a.*.*b.*d',s)

The first match will complete instantly, but the second one will probably 
take a few seconds to fail.

(My memory may be a bit rusty on this next bit)

There are traditionally two ways to match regular expressions. 
Deterministic Finite-state Automata (DFA), and Non-deterministic Finite-
state Automata (NFA). Most software uses NFA matching, which requires a 
small amount of memory to compile the regex, but requires a potentially 
unbounded time to do the match. DFA matching can match a regex in linear 
time (i.e. linear on the search string), but may require a very large 
amount of memory (and CPU) to compile the regular expression. If someone 
produced a DFA engine for Python which had the option of limiting the size 
of the compiled expression (and throwing an exception for any that were too 
complex), then you could probably expose this safely in Zope. There may be 
other problems with DFAs though, e.g. I'm not sure if you can match groups 
easily with a DFA.

DFA matches are best used when you have a fixed pattern so that the 
overhead for compilation is only hit once (e.g. parsers). With a DFA engine 
you might implement a persistent regex object in the ZODB. Not all regexes 
would be compilable into the object, but once you managed to compile one 
you would know it would work.

-- 
Duncan Booth                                             duncan@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

--

 

Duncan Booth                                             duncan@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?