[Zope-dev] problems with accented characters - need advice

Florent Guillaume fg@nuxeo.com
Thu, 20 Jun 2002 16:27:59 +0000 (UTC)


Florent Guillaume  <fg@nuxeo.com> wrote:
> Toby Dickenson  <tdickenson@geminidataloggers.com> wrote:
> > On Tuesday 18 Jun 2002 8:44 pm, Dieter Maurer wrote:
> > 
> > > The reason why it is still there is that a change should work for
> > > all languages and not only western ones. This poses the question
> > > how the byte string representing the id your should be URL quoted.
> > 
> > There is an RFC, I forget which one, which specifies utf8.
> 
> I'm not sure about that, last time I digged I found that it was
> explicitely said in the RFCs that no encoding was specified and that it
> was up to the application to decide what to use.

To be precise:

RFC 2068 (HTTP 1.1) says:

   For definitive information on URL syntax and semantics, see RFC 1738
   [4] and RFC 1808 [11]. The BNF above includes national characters not
   allowed in valid URLs as specified by RFC 1738, since HTTP servers
   are not restricted in the set of unreserved characters allowed to
   represent the rel_path part of addresses, and HTTP proxies may
   receive requests for URIs not defined by RFC 1738.

RFC 1808 (URL) doesn't talk about charset or bytes.

RFC 2396 (URI, updates RFC 1808 & RFC 1738) says, and I'll quote
extensively. Note the last two paragraphs.

  2.1 URI and non-ASCII characters

     The relationship between URI and characters has been a source of
     confusion for characters that are not part of US-ASCII. To describe
     the relationship, it is useful to distinguish between a "character"
     (as a distinguishable semantic entity) and an "octet" (an 8-bit
     byte). There are two mappings, one from URI characters to octets, and
     a second from octets to original characters:

     URI character sequence->octet sequence->original character sequence

     A URI is represented as a sequence of characters, not as a sequence
     of octets. That is because URI might be "transported" by means that
     are not through a computer network, e.g., printed on paper, read over
     the radio, etc.

     A URI scheme may define a mapping from URI characters to octets;
     whether this is done depends on the scheme. Commonly, within a
     delimited component of a URI, a sequence of characters may be used to
     represent a sequence of octets. For example, the character "a"
     represents the octet 97 (decimal), while the character sequence "%",
     "0", "a" represents the octet 10 (decimal).

     There is a second translation for some resources: the sequence of
     octets defined by a component of the URI is subsequently used to
     represent a sequence of characters. A 'charset' defines this mapping.
     There are many charsets in use in Internet protocols. For example,
     UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
     of characters in the repertoire of ISO 10646.

     In the simplest case, the original character sequence contains only
     characters that are defined in US-ASCII, and the two levels of
     mapping are simple and easily invertible: each 'original character'
     is represented as the octet for the US-ASCII code for it, which is,
     in turn, represented as either the US-ASCII character, or else the
     "%" escape sequence for that octet.

     For original character sequences that contain non-ASCII characters,
     however, the situation is more difficult. Internet protocols that
     transmit octet sequences intended to represent character sequences
     are expected to provide some way of identifying the charset used, if
     there might be more than one [RFC2277].  However, there is currently
     no provision within the generic URI syntax to accomplish this
     identification. An individual URI scheme may require a single
     charset, define a default charset, or provide a way to indicate the
     charset used.

     It is expected that a systematic treatment of character encoding
     within URI will be developed as a future modification of this
     specification.


Florent

-- 
Florent Guillaume, Nuxeo (Paris, France)
+33 1 40 33 79 87  http://nuxeo.com  mailto:fg@nuxeo.com