[Zope-Coders] Analysis: BTrees and Unicode and Python

Guido van Rossum guido@python.org
Fri, 19 Oct 2001 11:52:01 -0400


Good job, Andreas!

> After lots of debugging here an explanation for the behaviour we have
> seen in the unittest:
> 
> - The BTrees calls PyCompare_Object() several times before the
>   comparison that failed (unicode vs. unicode)
> 
> - one of these earlier comparision checks a Python string (containing
>   and accented character) against a unicode string and raises a
>   unicode exception  (ASCII decoding error: ordinal notr in range(128)).
>   I assume because the default encoding is ascii.

Note that this was a conscious design decision.  Not all the world
uses Latin-1, and many real-world programs and data use different
interpretations of 8-bit characters with the high bit set.  Assuming
Latin-1 when comparing to Unicode would be wrong.

> - there is no check in the BTree code to check for an exception after
>   PyObject_Compare() and so this error got never cleared

This should be fixed before proceeding.

> - when when trying to compare two identical unicode strings, Python
>   calls default_3_way_compare() and runs into the following code:
> 
> 
> static int
> default_3way_compare(PyObject *v, PyObject *w)
> {
>     int c;
>     char *vname, *wname;
> 
>     if (v->ob_type == w->ob_type) {
>         /* When comparing these pointers, they must be cast to
>          * integer types (i.e. Py_uintptr_t, our spelling of C9X's
>          * uintptr_t).  ANSI specifies that pointer compares other
>          * than == and != to non-related structures are undefined.
>          */
>         Py_uintptr_t vv = (Py_uintptr_t)v;
>         Py_uintptr_t ww = (Py_uintptr_t)w;
>         puts("\t\t\tdefcmp 1");
>         return (vv < ww) ? -1 : (vv > ww) ? 1 : 0;
>     }
> 
>   This code returns -1 for the two identical unicode strings.
> 
> I am not sure if this code is able to compare two unicode strings.
> On the other hand it is still strange that the unittest works when
> replacing the same unicode string in the list with the testdata in the
> unittest
> with self.s as described earlier.
> 
> Any ideas about that ?

It is definitely a bug if comparison of two unicode strings ends up
calling default_3way_compare()!

This normally doesn't happen though -- the Unicode object's comparison
code is generally called.

I'd like to see what's on the stack when default_3way_compare is
called with two Unicode objects.

Which Python version is this?  2.1 or 2.1.1?

--Guido van Rossum (home page: http://www.python.org/~guido/)