[Zope-Coders] Re: [Zope-dev] Unicode treatment in 2.6b1

Florent Guillaume fg@nuxeo.com
27 Sep 2002 18:02:17 +0200


On Fri, 2002-09-27 at 08:57, Toby Dickenson wrote:
> On Thursday 26 Sep 2002 10:58 pm, Florent Guillaume wrote:
> > For PageTemplates, the various blocks produced by the template and
> > python are sent to an StringIO-like objects, which is responsible for
> > converting them into a coherent thing when its getvalue() method is
> > called. At the moment it doesn't deal very well mixed Unicode and
> > non-Unicode strings so the reported failures don't surprise me. WE NEED
> > TO FIX THIS BEFORE THE NEXT BETA,
> 
> I agree. Is someone committed to working on this?

Yes, I am.

> > probably also by providing an explicit
> > native encoding.
> 
> Thats not what dtml currently does, and I dont see an obvious reason why page 
> templates should be different. The dtml semantics have been worked out 
> carefully over the last few years.

But how much feedback from international users did you have until
recently?

> The problem with this proposed approach is that it confuses the encoding of 
> the *document* with the encoding of the *attributes* of the objects which are 
> used to create the document. Page templates often deal with diverse objects 
> from different source; how is it to know that all objects use the same 
> character encoding for 8-bit strings?

Because if, say, some Greek guy puts 8-bit strings in the source of his
pages (and believe me he does it all the time :-), and in the attributes
of objects, they're all likely to be in *his* native default encoding,
which happens to be latin-7. Until Unicode was in, he just had to slap a
content-type: text/html; charset=iso-8859-7 and all was well. Same thing
for Russian, Japanese, etc. My point is that it is very likely that
there was a uniformity of encodings (otherwise the application would
already render weirdly on the browsers).

Enters Unicode. For some reason, now part of the strings he generates
are in Unicode (and may even contain non-latin-7 characters but that's
not my point). Page templates should merge the Unicode strings and the
8-bit strings harmoniously, and in my example that means using latin-7
as an encoding for 8-bit strings.

So I think that the merging of strings in TAL (getvalues()) should have
a way to either use the detected encoding in the same way as ZPublisher
does (sniffing the set Content-Type header), or TALInterpreter should be
passed an explicit charset, which would have to be passed by
PageTemplate. I don't know what's the best choice.

(Note that Localizer does it differently: it makes StringIO sniff the
content-type header charset, but convert everything to 8-bit and thus
gives a plain string to Zublisher. This has problems if you want to mix
8-bit character sets...)

> New objects should be exposing these attributes as unicode objects, and legacy 
> objects would have had to expose them as latin-1 if it wanted them rendered 
> correctly in the ZMI.

(testing manage_propertiesForm)

In Zope 2.5.1, the ZMI doesn't set any charset encoding so the scenario
above would send 8bit character strings, and the user would have his
browser autodetect (or not) that the encoding should be latin-7. This is
how it works today.

When migrating to Zope 2.6, all the preexisting string properties
containing latin-7 will be sent as 8-bit strings in the ZMI, which would
be encoded by DTML into Unicode as latin-1 (fixed encoding) so would
render unexpectedly. Providing an explicit charset for conversions
(maybe simply as an environment variable, that's for legacy after all)
would correct that.


Florent

-- 
Florent Guillaume, Nuxeo (Paris, France)
+33 1 40 33 79 87  http://nuxeo.com  mailto:fg@nuxeo.com