[Zope3-dev] Sprintathon diary, second and last part

Guido van Rossum guido@python.org
Sat, 07 Dec 2002 19:54:04 -0500


Continued Sprintathon diary for Friday.

Project for Friday morning (once power is restored; power in all of
Rotterdam is out from 9am till 1am): add a "Register existing objects"
command to SteveA's Registration utility.  First approximation of the
algorithm: starting at the root, traverse down the content namespace
and register all objects found.  Second approximation: start at the
content object to whose Service Manager the Registration Utility is
attached.

Subproblem one: how to find this content object, given the
Registration Utility object itself.  Several suggestions from SteveA:
get the location (a tuple of strings) and look for '++etc++Services',
or at least an item starting with '++etc++'; the path up to (but not
including) that point identifies the content object we are looking
for.  This assumes the RU is in the service area; that might not
always be true, but for this example it certainly is (the only menu
where you can create one is in the "Add Service"menu).  Another
approach would be to start at the other end: beginning with the RU
object itself, keep backing up to the parent in the (full) namespace
until a content object is found.  There are two ways to do this: by
stripping items from the end of the path and doing lookups (path
traversals from the root), or (perhaps more direct) by starting at the
context of 'wrapped_self' and backing up to the parent context until a
content object (or the root) is reached.

These latter approaches require us to have a test for what is a
content object.  There currently isn't an interface that we can test
for using isImplementedBy or queryAdapter to use for this, unless we
assume that Service Managers are always attached to Folders: then it
is easy, since folders implement IFolder and the various objects in
the Service Manager (the SM itself, its Packages directory, and
packages) don't.  SteveA suggested that there should be such an
interface, so he can attach service managers to non-folder objects,
but adding such an Interface would require a lot of programming.  Also
it's not clear whether isImplementedBy is sufficient or whether we
should use queryAdapter.  The advantage of the latter is that we don't
have to modify all content objects to add an IContent marker interface
to their __implements__ attribute.  In the end we decide to search the
location tuple for an item that .startswith('++etc++').

A bit about locations: there are several different ways to represent a
location, and in general functions/methods that take a location accept
them all.  One form is a unicode string containing slashes for
separators; a leading slash means an absolute path.  Another form is a
sequence of unicode strings that don't contain slashes; a leading
empty string means an absolute path.  The root is represented by ['']
or ('',); an empty list or tuple is not a path (I guess it means an
empty relative path :-).  I think there are rules for the characters
allowed in path components for content objects, but I don't know what
they are; from observation, a name starting with '++etc++' has
something do do with services, while a name starting with '@@' is (a
shortcut for?) a view.  A name consisting of a single '+' is an 'Add
view', which has magic behavior that I haven't quite understood yet; I
think it may be a shortcut for a view on a container that can be used
to add new items.

The most canonical representation of locations (AKA paths) is as a
tuple of Unicode strings; tuples are handy because they can be used as
keys (e.g.  the ObjectHub uses this in its location-to-hubid mapping).
There are convenience routines to translate locations between both
forms, as well as to go between locations and objects,and the
ObjectHub has methods to translate between hubids and objects or
locations.  Also handy: getPhysicalPath(context) gives the location
(as a tuple) of the context (typically an object), and
getPhysicalRoot(context) gives the root object.  traverseName(object,
name) yields a named subobject (where object must implement
IReadContainer I believe, and be wrapped in a context) and
traverse(object, location) yields an object gotten by traversing the
whole location.  I don't know what traversing an absolute path does (I
suppose it could start from the root or ignore the '' name).
Traversal is complicated by the '++' and '@@' conventions, and by
adapters, and by context wrappers, and by security proxies.  There's a
'traverser' object, which we ended up not using, and an even
lower-level 'ITraversable' interface which we also ended up not using
(it is used by traverse() and traverseName(), and its API is seriously
weird).  There's also a ContainmentIterator helper class, which is an
iterator that gives you the path from a context to its root.
Actually, there are two ContainmentIterator classes: the original, in
a module by itself in the Zope.ContextWrapper package, and a
proxy-aware one (which you should aways use) in the
Zope.Proxy.ContextWrapper module.  Enough already.

In the end, we ended up with very simple straightforward code.  To
find the folder in whose service manager we are (too bad we can't use
Python 2.3's enumerate() builtin):

def findContentObject(context):
    location = getPhysicalPath(context)
    for i in range(len(location)):
        if location[i].startswith("++etc++"):
            location = location[:i] break
    else:
        raise ValueError, "can't find ++etc++ in path"
    root = getPhysicalRoot(context)
    return traverse(root, location)

To register all objects starting with a given object, we use a simple
recursive method:

    def _registerTree(wrapped_self, object):
        wrapped_self._registerObject(object)
        if not IFolder.isImplementedBy(object):
            return
        names = object.keys()
        for name in names:
            sub_object = traverseName(object, name)
            wrapped_self._registerTree(sub_object)
    _registerTree = ContextMethod(_registerTree)

(See part 1 of this diary for ContextMethod.)  Note that the IFolder
test is a policy decision.  Another policy decision is in
_registerObject():

    def _registerObject(wrapped_self, object, hub):
        try:
            getService(wrapped_self, "ObjectHub").register(object)
        except ObjectHubError:
            pass
    _registerObject = ContextMethod(_registerObject)

This attempts to register every object, which is another policy
decision: for the textindex it would be sufficient to only register if
ISearchableText.isImplementedBy(object), but maybe a path index or a
metadata index would be interested in other objects.  The Registration
Utility exists to implement such policies.  I suppose a generalized RU
could be written that takes policy specifications as arguments,
although SteveA was surprisingly lukewarm about this idea (whereas
otherwise he's always the first to propose the most general mechanism
imaginable, based on marker interfaces and adapters, and refactoring
some code to make it more general, if at all feasible :-).  (Sorry
Steve.  :-) Note that an important part of the policy may have to do
with workflow: e.g.  an object may only be indexed when it becomes
published.  For the editors' use, another textindex for unpublished
objects could be conceivable.  I guess this would have to use a
different object hub instance, since the textindex effectively listen
to object hub (un)registration events.

The try/except is needed because we may be registering objects that
were already registered.  The object hub currently raises an exception
in this case.  At the sprint we thought it would be a good idea to
change this to to returning True/False about a duplicate object, to
distinguish it from more severe errors; but actually, ObjectHubError
is only raised for duplicate or unknown locations; other errors use
different exceptions.  So this proposed refactoring is not needed, and
catching the exception is exactly right.

Hooking this all up to the UI is trivial, as long as you don't mind
that pilot errors cause exceptions, e.g.  this will raise an exception
if there's no ObjectHub.

While writing this, I also realized that every call to
_registerObject() makes a call to getService() to get the hub; it's
more efficient to do this one and pass the hub around.  We can't store
the hub as an instance variable though: at different times, we may end
up using different hubs; but within one call to registerExisting() we
should be able to count on having the same hub the whole time, so we
can refactor as follows:

    def registerExisting(wrapped_self):
        object = findContentObject(wrapped_self)
	hub = getService(wrapped_self, "ObjectHub")
        wrapped_self._registerTree(object, hub)
    registerExisting = ContextMethod(registerExisting)

    def _registerObject(wrapped_self, object, hub):
        try:
            hub.register(object)
        except ObjectHubError:
            pass
    _registerObject = ContextMethod(_registerObject)

Now notify changes to:

    def notify(wrapped_self, event):
        hub = getService(wrapped_self, "ObjectHub")
        self._registerObject(event.object, hub)
    notify = ContextMethod(notify)

A word about 'wrapped_self'.  By convention (not sure if this is done
consistently) a 'context method' (see diary part 1) uses wrapped_self
instead of self.  In part 1 of this diary I complained about this,
because I kept making mistakes (writing 'self').  I brought it up with
SteveA, who was adamant that the convention is important for context
methods.  He pointed out that (a) many methods end up doing something
like 'self = unwrap(wrapped_self)' and (b) I was writing code that
runs in the service area (even though it's not a service) and hence is
pretty atypical; he claims that content code rarely needs to declare
context methods.  I'm willing to accept this and will try to get used
to it.  I note that PyChecker may need to be educated about this
convention.

***

While waiting for a dinner group to assemble, Steve & I brought up an
idea we had yesterday to make the subscription API easier to use.  I
noted that when an object is interested in multiple events and needs
to do different things based on the event type, you end up doing
(relatively) expensive isImplementedBy() tests inside notify() for
each event.  For example:

    def notify(self, event):
        if (IObjectRegisteredHubEvent.isImplementedBy(event) or
            IObjectModifiedHubEvent.isImplementedBy(event)): ...call
            index_doc(), which also acts as a reindex_doc()...
        elif IObjectUnregisteredHubEvent.isImplementedBy(event):
            ...call unindex_doc()...

While the index_doc() call is easily the most expensive thing here, in
other cases it may not be, and the isImplementedBy() call can never be
made as efficient as a simple comparison between two ints or strings.
(An int or string would also make it possible to use a dict to
implement a switch.)

We thought it would be cool and more Pythonic if instead of
subscribing an object to events, you could subscribe a callable.  Then
you could have separate methods:

    def notify_index(self, event):
         ...call index_doc()...

    def notify_unindex(self, event):
         ...call unindex_doc()...

But Jim immediately saw the problem with this: subscription are
persistent, but you can't pickle an arbitrary callable.  Certainly you
can't pickle a bound method; I think I could make it possible to
pickle a bound method of a persistent object (as a tuple
<persistent_id>, <method_name>), but that doesn't work today.  He
proposed a compromise: allow an extra piece of data (a 'memento',
Steve called it; I think this is a term from the Gang-of-Four patterns
book) to be given at subscription time that gets passed as a second
argument to notify().  Steve may undertake this refactoring:

    class ISubscribable(Interface):
        def subscribe(interface, memento=None): ...

    class ISubscriber(Interface):
         def notify(event, memento=None): ...

For backwards compatibility (i.e.  to avoid having to add a memento
argument to every notify() method in existence) I'd propose that the
subscribable calls notify() as follows:

    if memento is None:
        subscriber.notify(event)
    else:
        subscriber.notify(event, memento)

Then the above code could be refactored as follows:

        hub.subscribe(self, IObjectRegisteredHubEvent, 'index')
        hub.subscribe(self, IObjectModifiedHubEvent, 'index')
        hub.subscribe(self, IObjectUnregisteredHubEvent, 'unindex')

    def notify(self, event, memento):
        if memento == 'index':
            ...call index_doc...
        else:
            ...call unindex_doc...

***

The Zope grand renaming.  Martijn presented a convincing new naming
scheme, which won't repeat here.  I have one suggestion though.  Let's
call it Zope
4.  This synchronizes the version numbers of ZODB and Zope (and ZEO
   could
be switched to 4 as well).  It also avoids confusion in the developer
world about which code base is referred to.  Alexander Limi countered
that to him, as a graphical designer, Z3 looks better than Z4.  Maybe,
though I note that if we had to invent a logo for Z4, the similar
slants of the Z and the 4 could be used for some cuteness.

***

About security assertions and str()/repr().

When testing query() from a ZPT template for the first time, I noticed
that when I rendered the result of a failing query(), it showed
something like <security proxy wrapper ...  for tuple object at ...>,
rather than ([], 0) which was the actual value (after unwrapping).
Friday afternoon, Steve mentioned that he discovered that this was
because the security assertions were missing for str() and repr() of
the tuple, list and dict types.  He fixed the problem by adding such
assertions.  We discussed this with Jim, who pulled a sad face, and
brought up the following use case which had made him decide not to
allow str() and repr() on the standard container types: suppose you
have an object for which accessing it requires a certain permission,
which you don't have.  Now suppose that this object is in a container
for which you *do* have access permission.  If you extract the object
from the container using __getitem__, it will be wrapped in a security
proxy, preventing you from accessing it.  But if you can call str() on
the container, this will call str() on the *unwrapped* object,
revealing the str() of the object which you shouldn't be able to see.

My counter-position is that developers *really* expect str() and
repr() to work for all objects, and returning <security proxy
wrapper...> for a tuple is unacceptable (or at least really
inconvenient :-).  I think that objects containing sensitive
information should not reveal this information through repr() or
str().  Jim brought up the counter-example of a text document whose
str() was the text of the document, but agreed that the text should
probably be accessed through some other method than __str__() or
__repr__().  So perhaps we can agree that str() and repr() are always
allowed, and do a bit of developer education warning that objects
shouldn't reveal sensitive contents through str() or repr().  This
does mean that you can't use a built-in container type (tuple, list or
dict) to hold sensitive information in the form of e.g.  strings (e.g.
plaintext passwords) but that's probably a bad thing to do anyway.  I
think Jim was open to this.

***

One more thing I did (actually the previous night in my hotel room): I
added real object links to the primitive query facility in the text
index control page.  (The other team, which is implementing the real
query facility, which will be a very general design by SteveA, which
should be at least as general as the current ZCatalog in Zope 2, isn't
ready yet.  They promised it real soon now.  I guess the power outage
screwed up their schedule.)  Anyway, my query implementation adds
three helper methods to the IUITextIndex interface:

    hubid2location(hubid) -> location string
    hubid2object(hubid) -> object
    hubid2title(hubid) -> dublin core title string

Using these I can build a simple results table that include the object
title, location, and alive link to that location (using a surely
simplifying assumption about how to turn links into URLs).  Of course,
this requires more Python expressions, and I should repeat that I am
totally aware that this is the wrong approach: I *should* have defined
a separate view implemented as a Python class that returns the query
results in a more palatable form.  Maybe I'll get to that on a
subsequent iteration, or somebody else will.  I also believe I could
do batching from ZPT using Python expressions, but I'll refrain from
attempting that, especially since we don't have a bulk upload
mechanism, and I've never had more than three searchable objects in
the system :-.  Christian Z mentioned that there's a tool (called
zclient?) that sends HTTP requests to a Zope server to do these kinds
of things.  Note that adding objects directly to the ZODB would be
wrong, because that would bypass all the Zope/App machinery for
events, locations, etc.

--Guido van Rossum (home page: http://www.python.org/~guido/)