[Zope-dev] redirect burps on unicode URLs

Adam GROSZER agroszer at gmail.com
Mon Mar 1 11:04:29 EST 2010


Hello Christian,

Isn't it that anything below chr(128) converts to utf-8 as the same
character? That would mean that slash and ampersand will stay as it
is.
OTOH encoding is done only on non-ascii characters. Supposed that the
encoding is utf-8. What's hardwired into absoluteURL.

Monday, March 1, 2010, 4:40:30 PM, you wrote:

CT> On 03/01/2010 03:34 PM, Wichert Akkerman wrote:
>> On 3/1/10 15:09 , Christian Theune wrote:
>>> Hi,
>>>
>>> On 03/01/2010 02:28 PM, Martin Aspeli wrote:
>>>>
>>>> I'm with Wichert here.
>>>>
>>>> In most places, we tend to carry around unicode strings internally, and
>>>> only encode on the boundaries, e.g. when the URL is "rendered". I don't
>>>> see why redirect() can't have a sensible and predictable policy for
>>>> unicode strings, making life easier for everyone.
>>>>
>>>> If we think that non-ASCII URLs are illegal, then maybe we should
>>>> validate for that and throw an error. However, I don't think that's the
>>>> case (anymore?). In that case, passing a unicode object to the function
>>>> seems entirely consistent with other places, e.g. when we pass unicode
>>>> to the page template engine or return unicode from a view, which the
>>>> publisher then encodes before it's pushed down to the client.
>>>
>>> I opened a question in another part of the thread, but haven't gotten an
>>> answer yet. In my understanding, a Unicode string is not able to
>>> represent the structural properties of a URL in http scheme properly,
>>> thus encoding back to ASCII is not possible.
>>>
>>> Can someone confirm or disprove this?
>> 
>> I am not sure what you mean. On the wire you get a path component in a 
>> HTTP get request which is UTF-8 encoded and escaped. For example 
>> http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 
>> , which is a Japanese string if you decode it back to unicode. That 
>> encoding works fine in two directions, and all other properties used in 
>> the http scheme such as query strings and fragments work normally. Can 
>> you provide an example of something that might not work?

CT> The problem is that a URI has internal structure which looks to me like
CT> it can't be reconstructed properly if it was decoded into a "regular"
CT> unicode string.

CT> E.g. reserved characters are probably decoded into their regular symbols
CT> (e.g. a slash embedded in a path component or ampersands used in query
CT> arguments), so escaping needs to be done (manually) before encoding.
CT> Also, some parts of a URI can use other ways to encode symbols.
CT> Hostnames would like to be encoded to punycode whereas URIs don't even
CT> say what character set unicode characters should be encoded to. That
CT> would be up to the application (e.g. our publisher, so that's manageable).

CT> I have the feeling that roundtrip behaviour of URI -> unicode string ->
CT> URI won't be possible fully correctly and thus may be susceptible to
CT> interference from the outside.

CT> I still hope we can do better than doing nothing about it. I just think
CT> it's more complex than calling encode('something'). ;)

CT> Christian



-- 
Best regards,
 Adam GROSZER                            mailto:agroszer at gmail.com
--
Quote of the day:
Reflect upon your present blessings - of which every man has many- not on your past misfortunes, of which all men have some. 
- Charles Dickens 



More information about the Zope-Dev mailing list