[Zope3-Users] Re: Problem with unicode decode error

Sun May 28 09:07:07 EDT 2006

Hi,

> I'm reading in a file to create content objects out of this file, I'm using 
> the csv.reader object for that purpose.

>From what I read below it seems like you're storing the raw string data
and aren't decoding non-ASCII characters. Content objects usually store
unicode data because it's much easier to deal with unicode than with
encoded strings within Python.

> The name is choosen by a customized name chooser adapter:
> 
> class XGMNameChooser(NameChooser):
>     implements(INameChooser)
>     
>     def chooseName(self, name, object):
>         if IAbbreviation.providedBy(object):
>             n = unicode(object.abbreviation)  <---
>             if object.abbreviation in self.context:
>                 i = 0
>                 while n in self.context:
>                     i += 1
>                     n = unicode(object.abbreviation) + u'-' + unicode(i)
>             self.checkName(n, object)
>             return n
>         else:
>             return super(XGMNameChooser, self).chooseName(name, object)
> 
> 
> However, if object.abbreviation contains a Umlaut (äöü) it breaks:
> 
>   File "/home/florian/Desktop/zope/lib/python/xgm/xgm.py", line 27, in 
> chooseName
>     n = unicode(object.abbreviation)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: 
> ordinal not in range(128)

Yes, of course it will break because unicode() doesn't know how to deal
with encoded string data unless you give it a specific encoding by which
it can decode the string. I suspect unicode(object.abbreviation,
'latin1') or unicode(object.abbreviation, 'utf-8') would work.

However, I still advise you to already make object.abbreviation a
unicode object. How would the XGMNameChooser know which encoding to
apply? All views, all components that work with your abbrevaition
objects would have to know the right encoding in order to work with the
data properly. Instead you should make the csv.reader-based input
mechanism decode to unicode once and then deal only with unicode.

Philipp