[Zope3-Users] Re: Unicode for Stupid Americans (like me)?

Jeff Shell eucci.group at gmail.com
Thu Mar 1 15:20:06 EST 2007


On 3/1/07, Paul Winkler <pw_lists at slinkp.com> wrote:
> On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote:
> > It's been years since I dug into this, but I'm better than 90% sure
> > that the browser is expected to make its requests in the encoding of
> > the response (i.e., the one set by Content-Type).  It's been too long
> > for me to tell you if that's in a spec or if it is simply the de
> > facto rule, though I suspect the former.
>
> That almost makes sense, except that the first request precedes the
> first response :) I'll have to dig into this some more when I have
> time...

By first request do you mean first form-submission? You have to do a
request to get the form. When the server sends the form, the HTTP
response containing the form should have a content type.

If the form to be submitted has an accept-charset attribute explicitly
declared, that should become the value of the Accept-Charset header.
If that field is absent, it's supposed to be understood as a special
value, 'UNKNOWN', which means that the browser or other user-agent may
submit (I don't remember if the spec says MAY or SHOULD, but I know it
doesn't say MUST) the response in the same character set as the form's
page.

I did a fair amount of spec reading and zope.publisher.http/browser
entrail reading yesterday, can you tell? :)

Anyways, without adding accept_charset to the form, this is what
Firefox sent on a form submission request's Accept-Charset header::

    ISO-8859-1,utf-8;q=0.7,*;q=0.7

Zope turned that into::

    ['utf-8', 'iso-8859-1', '*']

Zope gives UTF-8 priority over everything. The Accept-Charset header,
if present on the request, is used to establish the response character
set unless explicitly stated otherwise (or the response isn't text).
So I guess if my Firefox is sending that same accept-charset header to
Zope on each request, it will get a UTF-8 response every time (again,
unless explicitly made otherwise). If it is supposed to submit POSTs
in the same character set that it received, then it should be sending
UTF-8 each time. Hunh.

So if you had <form ... accept_charset="cp437">, then the browser
should send only cp437 in the Accept-Charset header and Zope should
only try to decode from that character set; and the succeeding
response should be encoded in cp437 as well. I think. That seems to be
the best I can figure out between the HTML 4.01 and HTTP 1.1 specs and
zope.publisher's http/browser request and response handlers. It seems
unlikely that you would ever need to use accept_charset like this,
though; at least not in Zope which does a good job of doing all of
this encoding/decoding work.

Well, all of this is good to finally know. This has been a mysterious
black box to me for such a long time, and it turns out that I don't
need to worry about it.

The lessons I've learned for text, as they apply to my own code, are thus:

- Work in unicode, not strings; then you won't have to worry about collisions
  between unicode and strings ('ab' + u'cdé') raising decode errors.

- When working with text, decode strings to unicode instead of encoding
  unicode to strings. I was forcably **encoding** my unicode objects when I'd
  be building up long strings, which came from my confusion over
  encode/decode. This is how I'd lose my extended characters and end up with
  garbage output.

- Be alert to what other text processing tools such as the Python
  implementations of Textile and Markdown want as input and return as output.
  In my ignorance, I wasn't paying attention to the fact that I needed to
  decode the results back to unicode, and I believe this was another systemic
  central point of pain, torture, and failure for my apps. And in my ignorance
  I tried to fix the errors that I saw with forcable *encoding* instead of
  *decoding*, which is why I would see garbage characters show up in
  certain situations. I now realize this is the right way to work with those
  tools::

        rendered = textile(content.encode('utf-8'), encoding='utf-8',
                           output='utf-8')
        return rendered.decode('utf-8')

Does that all sound right?

-- 
Jeff Shell


More information about the Zope3-users mailing list