[Grok-dev] Problem with character encoding

Luciano Ramalho luciano at ramalho.org
Tue Jul 8 18:21:43 EDT 2008


On Tue, Jul 8, 2008 at 6:49 PM, Sebastian Ware <sebastian at urbantalk.se> wrote:
> Many thanks for your patience Luciano! I wish I was just tired, but
> unfortunately it is the character encoding that confuses me :(

You are very welcome, Sebastian!

> I was expecting
>
>   u'å'.encode('iso-8859-1')
>
> to encode the unicode string to a 'iso-8859-1' encoded string, but as you
> are pointing out, it returns a two byte encoding.

No, it returns a one byte encoding, which is represented by an hex
character code when the Python console displays it:

>>> c = u'å'.encode('iso-8859-1')
>>> c
'\xe5'
>>> len(c)
1
>>>

> However, it is eventually
> encoded properly by urllib.urlencode and allows me to (in this case) send an
> sms with non-ascii characters.
>
> The spec I need to meet is:
>
>  -perform a http-post with a 'iso-8859-1' encoded string
>
> I can do it in the python interpreter, but once I use a string stored in the
> Zodb, non-ascii characters go bonkers...

I really don't see what the ZODB has to do with it.

I think you are getting confused by the fact that Python actually has
two string types today: str and unicode. You use the str.decode method
to convert from a string in particular encoding (such as iso8859-1 or
utf-8) to unicode, and unicode.encode to do the opposite: convert a
unicode object to a str object, using a certain encoding to do it.

Take a look... c is a str containing the ISO-8859-1 char for å (one byte)

>>> c = u'å'.encode('iso-8859-1')
>>> c
'\xe5'
>>> len(c)
1

Now I convert it to a unicode object, containing the same char (here,
len does not tell me the number of bytes, but the number of characters
in the unicode object, which is really what matters to us most of the
time):

>>> u = c.decode('iso-8859-1')
>>> u
u'\xe5'
>>> len(u)
1

If we convert the same unicode object back to str, but using the UTF-8
encoding, the result is a two-byte str:

>>> t = u.encode('utf-8')
>>> t
'\xc3\xa5'
>>> len(t)
2

Hth!

Cheers,

Luciano


More information about the Grok-dev mailing list