[Zope-CVS] CVS: Products/ZCTextIndex - QueryParser.py:1.1.2.18

Tim Peters tim.one@comcast.net
Mon, 13 May 2002 17:39:59 -0400


Update of /cvs-repository/Products/ZCTextIndex
In directory cvs.zope.org:/tmp/cvs-serv29119

Modified Files:
      Tag: TextIndexDS9-branch
	QueryParser.py 
Log Message:
Redid the module docstring to make the grammar clearer (I hope).
Redid the tokenization regexp to make the role of a hyphen clearer.

One way we still differ from google:  google treats an unclosed quoted
string as if it were closed; e.g., try these searches:

    cool back pain
    cool "back pain"
    cool "back pain

The last two act the same, and are very different from the first.

We effectively ignore the lone double quote in the last case, treating
it as if it were the first case (findall can't find a match starting at
the quote in the third case, so just skips over the quote).


=== Products/ZCTextIndex/QueryParser.py 1.1.2.17 => 1.1.2.18 ===
 Term = '(' OrExpr ')' | ATOM+
 
-An ATOM is a string not containing whitespace or parentheses or double
-quotes, and not equal to one of the key words 'AND', 'OR', 'NOT'.  An
-ATOM can contain whitespace, parentheses and key words enclosed in
-double quotes.  The key words are recognized in any mixture of case.
+The key words (AND, OR, NOT) are recognized in any mixture of case.
+
+An ATOM is either:
+
++ A sequence of characters not containing whitespace or parentheses or
+  double quotes, and not equal to one of the key words 'AND', 'OR', 'NOT'; or
+
++ A non-empty string enclosed in double quotes.  The interior of the string
+  can contain whitespace, parentheses and key words.
+
+In addtion, an ATOM may optionally be preceded by a hyphen, meaning that it
+must not be present.
+
 When multiple consecutive ATOMs are found at the leaf level, they are
 connected by an implied AND operator, and an unquoted leading hyphen
-is interpreted as a NOT operator.  When an ATOM contains multiple
-words (where a word is a string of letters, digits and underscore), it
-specifies a phrase search.
+is interpreted as a NOT operator.
 
 Summarizing the default operator rules:
 
@@ -39,7 +46,6 @@
 - words connected by punctuation implies phrase search, e.g. ``foo-bar''
 - a leading hyphen implies NOT, e.g. ``foo -bar''
 - these can be combined, e.g. ``foo -"foo bar"'' or ``foo -foo-bar''
-
 """
 
 import re
@@ -68,10 +74,15 @@
 _tokenizer_regex = re.compile(r"""
     # a paren
     [()]
-    # or a string in double quotes possibly preceded by a hyphen
-|   -? " [^"]* "
-    # or a non-empty string without whitespace, parens or double quotes
-|   [^()\s"]+
+    # or an optional hyphen
+|   -?
+    # followed by
+    (?:
+        # a string
+        " [^"]* "
+        # or a non-empty stretch w/o whitespace, parens or double quotes
+    |    [^()\s"]+
+    )
 """, re.VERBOSE)
 
 class QueryParser: