[Zope-Coders] Re: [Zope-dev] Unicode treatment in 2.6b1

Sun, 6 Oct 2002 20:26:05 +0000 (UTC)

I've been toying with ideas quite a bit.

To recap the problem: many users have page templates containing 8-bit
strings, and code generating 8-bit strings. Just because they now also
want to output Unicode strings, it's a bit harsh to ask them to remove
all their 8-bit strings and revert to ascii (which practically speaking
isn't feasible for the page templates).

In the current situation, generating a page template means joining a lot
of small string pieces, usually 8-bit strings (because that's how the
page template is stored), sometimes Unicode strings (the result of a
substitution by some computed Unicode string). But ''.join([u'caf', 'é'])
won't work in the majority of cases because the default encoding is set
to 'ascii' in site.py.

One solution would be for our derivative of StringIO in the page
template code to check for the case where ''.join fails (UnicodeError),
and then recode by hand any non-ascii 8-bit string into Unicode using
some assumed encoding. But this is very slow, quite a big speed penalty
for a lonely 'é' in a page template.

If we make the quite reasonable assumption that there will be only one
legacy 8-bit encoding throughout the site, then passing this encoding at
startup to sys.setdefaultencoding would make ''.join work (and still be
fast), and all would be well.

So I propose that in z2.py
 - we check some command line switch (-E foo for instance) and if it is
   set then we call sys.setdefaultencoding(foo),
 - we import site.py to get original behavior.

So we'd have a small modification to z2.py, and only adding -S -E foo to
the startup command line would enable it for those who want it.

Alternatively, we could have simply an argument-less -E, which would
mean "use the encoding from the locale" -- I'd prefer that solution.

Comments ?

Florent

Toby Dickenson  <tdickenson@geminidataloggers.com> wrote:
> On Monday 30 Sep 2002 1:00 pm, Florent Guillaume wrote:
> 
> > Ok, here's something that occured to me:
> > Why not explicitely use "locale.getlocale()[1] or 'latin=1'" as the
> > default encoding for all str->unicode conversions?
> 
> Is this just 'for legacy support', or a new feature that we plan to support?
> 
> A new ugly environment variable is something we can take away eventually, in 
> principal. Even if we cant ever do that in practice, we can encourage more 
> people to move to Unicode strings by threatening to ;-)
> 
> Anything based on locale feels more permanent.
-- 
Florent Guillaume, Nuxeo (Paris, France)
+33 1 40 33 79 87  http://nuxeo.com  mailto:fg@nuxeo.com