[Zope-CMF] Re: [dev] encoding issues: showstoppers for the CMF 1.5 release?

Thu Oct 7 10:17:50 EDT 2004

yuppie wrote:
> Hi!
> 
> 
> There are two encoding related issues I'd like to see resolved before 
> CMF 1.5 is released. Both issues lead to mixed encodings and as a result 
> to UnicodeDecodeErrors in Page Templates.
> 
> 
> A.) RSS Syndication fail on non-ascii
> -------------------------------------
> http://collector.zope.org/CMF/261
> 
> affected: CMFDefault syndication of content with other encodings than 
> ascii or unicode
> 
> This is a regression caused by the DTML to ZPT migration. If we agree on 
> adding a 'default_charset' property to CMF sites, the issue is easy to 
> fix by converting all strings passed to the template.
> 
> Are there better solutions? What should be the default 'default_charset'?

Two issues:

   - WRT the content already in the system:  we have no way of
     determining the encoding of "legacy" data stored as metadata for
     content objects, which makes it impossible to have *any* sane
     default value.  The site manager is going to have to
     make this call.  Until the site manager makes an explicit choice,
     this failure will (and must) continue to happen.

     In the future, we should consider adopting the strategy used by
     Silva and Zope3, which is to decode *all* text to Unicode on its
     way into the system (using the 'Content-type:' of the request),
     and re-encoding it on the way out.  Keeping encoded strings in
     the database is asking for trouble.  We will also need to write
     a legacy converter which allows the site manager to find all the
     encoded data in the system and supply an explicit encoding for it;
     that would allow us to decode it after the fact.

   - WRT the encoding of the outbound data:  it should be UTF-8, in
     conformance with the XML specs, unless the site manager customizes
     the template and changes its 'content_type'.

We could conceivably fix the first one by modifying the syndication 
tool, making its 'getSyndicatableContent' method return a list of 
dictionaries whose values were unicode strings.  Or, we might add a
method, 'toUnicode', to the tool, and call that from within RSS.py.
In either case, the tool would  for the context's 'default_encoding' 
property;  if found, it would be used to decode the value.

> B.) CMFSetup: import of type infos broken
> -----------------------------------------
> http://collector.zope.org/CMF/287
> 
> affected: CMFSetup imports. While you might not see any errors if your 
> content is ascii or unicode, all sites are inconsistent after imports. I 
> stumbled over this with type infos, but it looks like a general problem.
> 
> There are some other open CMFSetup issues, but I think this one is a 
> real showstopper. And so far I have no good idea how to fix this.
> 
> These are my thoughts so far, any help with this is welcome:
> 
> 1.) While it might be a future use case for CMFSetup to migrate 
> properties from string to unicode, for now we have to make sure that 
> imported properties are not unicode.

Note that both import and export contexts used by the CMFSetup tool have 
a 'getEncoding' method, and that the importer plugins are expected to 
use that value when manipulating XML data files.  The way to achieve 
what you want here is to supply and explicit encoding when setting the 
profile on the tool (pass the 'encoding' argument to 
'setProfileDirectory').  We could make that default to 'ascii', but that 
would be the only reasonable default.

  We *do* need to extend the tool's 'manage_udpateToolProperties' to 
take an encoding argument, as well as the ZMI template which calls it.

> 2.) If we don't specify an encoding, CMFSetup imports all properties as 
> unicode. To convert unicode to string at a later point, we have to 
> specify an encoding as well.
> 
> Is there a way to export / import "raw" property values, avoiding the 
> need to specify an encoding?

No.  Strings stored in external representation of XML *must* be encoded 
to match the encoding of the document, which is UTF-8 by default.  When 
importing them, we have three options:

   - Leave them as unicode, which then makes life potentially complicated
     later, when they are mixed with encoded strings.  This is what the
     tool currently does.

   - Encode them using an implicit default.  I would argue that 'ascii'
     is the only sane default, because it is a no-op for the core
     software, and because any other value is arbitrarily wrong for some
     set of users.

   - Encode them using an explictly-passed encoding.

> What would be a good default encoding? I guess it might sometimes not be 
> the same as the 'default_charset' for content. Should we restrict 
> configuration data to ascii?

That is the only reliable default.  the 'default_charset' hack is 
intended to be a transitional thing;  I would rather not depend on it.

Tres.
-- 
===============================================================
Tres Seaver                                tseaver at zope.com
Zope Corporation      "Zope Dealers"       http://www.zope.com