[Zope-Coders] Analysis: BTrees and Unicode and Python

Guido van Rossum guido@python.org
Fri, 19 Oct 2001 12:43:56 -0400


> ----- Original Message -----
> From: "Guido van Rossum" <guido@python.org>
> To: "Andreas Jung" <andreas@zope.com>
> Cc: "Jim Fulton" <Jim@zope.com>; <zope-coders@zope.org>
> Sent: Friday, October 19, 2001 11:52
> Subject: Re: [Zope-Coders] Analysis: BTrees and Unicode and Python

(Can you please edit out these headers from your replies?  They are
only confusing, and not needed for the context.)

> > Note that this was a conscious design decision.  Not all the world
> > uses Latin-1, and many real-world programs and data use different
> > interpretations of 8-bit characters with the high bit set.  Assuming
> > Latin-1 when comparing to Unicode would be wrong.
> 
> I assume the exception is raised before calling the PyUnicode_Compare
> function. Otherwise silently ignoring this error condition is also not
> a solution so I agree that Python behaviour is reasonable :)

I'm not sure I understand your question.  PyUnicode_Compare() is
called when at least one of the arguments to a 3-way comparison is a
Unicode object.  When the other is not, PyUnicode_FromObject() will
attempt to convert it to Unicode, and if it's an 8-bit string
containing non-ASCII characters, that will raise an exception, and
PyUnicode_Compare() will return -1.    Then default_3_way_compare()
calls PyErr_Occurred() which will return true; the exception is a
ValueError so it doesn't match TypeError, so default_3_way_compare()
will return -2 to indicate an error, and the error will be propagated
all the way up to the caller of PyObject_Compare().

> > I'd like to see what's on the stack when default_3way_compare is
> > called with two Unicode objects.
> 
> How can I determine that ?

I propose to change the code in default_3way_compare() as follows:

	if (v->ob_type == w->ob_type) {
		/* When comparing these pointers, they must be cast to
		 * integer types (i.e. Py_uintptr_t, our spelling of C9X's
		 * uintptr_t).  ANSI specifies that pointer compares other
		 * than == and != to non-related structures are undefined.
		 */
		Py_uintptr_t vv = (Py_uintptr_t)v;
		Py_uintptr_t ww = (Py_uintptr_t)w;
----->		if (PyUnicode_Check(v))
----->			abort();
		return (vv < ww) ? -1 : (vv > ww) ? 1 : 0;
	}

and then inspecting the stack trace with gdb.

If this abort() never happens, you need to look for a new theory. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)