[Zope3-dev] Sprintathon diary

Thu, 05 Dec 2002 15:54:55 -0500

Here are my notes from Tuesday through Thursday (focusing mostly on
Tue-Wed).  Enjoy!

Rotterdam Sprintathon Diary
---------------------------

[Tuesday Morning]

I'm part of the searching team.  Others: Steve Alexander (leader),
Christian Theune, Christian Zagrodnick, Ben Saller.  See also:

  http://dev.zope.org/Wikis/DevSite/Projects/ComponentArchitecture/CMSprintSearching

We sketch three interfaces that apply directly to the text index:
IQuerying, IRanking, and IInjection.  IQuerying defines query() which
takes a query string (more structured things are considered YAGNI for
now) and returns an unranked result set.  IRanking defines rank()
which takes a result set and "batching metadata" and returns the
requested batch, plus a total count.  IInjection is to be used by the
object hub when objects get registered, unregistered, or modified.

It seems that everybody is happy with most of the design decisions of
ZCTextIndex.

[Tuesday afternoon]

I discover a flaw in my explanation of ZCTextIndex: it doesn't
separate the querying from the ranking!  The query API returns a
mapping from docid to score (as a scaled int), ordered by docid.  What
we have to do instead is compute the N best items and scale the
scores.  It turns out that mhindex.py contains a TextIndex class that
serves as a perfect example.  We decide to nuke IRanking and change
IQuery's query() to take the batching info (two arguments start and
count) and return the batch as a list of (docid, score) tuples and a
total count, the score being scaled.  (The total is for reporting in
the style of "here are hits 1 through 10 out of 27".)

There was some confusion about the interfaces: first it appeared that
Steve wanted some of the building blocks to return stuff that was an
input parameter, such as the batching metadata and the original
query.  Later he took that back.  Also, the batching metadata was
simplified to two arguments (start, count).

Porting ZCTextIndex was pretty straightforward.  I killed the Zope2
ZMI specific code, and had to change all uses of the BTrees package
into Persistence.BTrees.  Later [Wednesday afternoon], I added
methods to get the number of documents and words (the latter replacing
the -- as Tim Peters pointed out -- poorly named length()).  I left
out the C extensions for now -- I still believe that stop word removal
is a bad idea, so there's no point in optimizing it, and for now the
Python version of the Okapi inner loop is fine.  Adding machinery for
building C extensions doesn't seem worth it now.

The TextIndex() class from mhindex.py was promoted, with small
modifications, to a "wrapper" class in the main TextIndex package.
The modifications were to support batching more directly, by giving
query() a start and count argument, and scaling the results to a float
in the range [0.0, 1.1].  This can't be done as a subclass, because
the actual indexing class may be configurable: OkapiIndex or
CosineIndex.  This let me clean up the public API quite a bit.

[Tuesday after dinner]

One unit test caused problems: it was testing locale-awareness by
setting a specific locale that apparently doesn't work everywhere, and
there was a complaint on the Zope3-dev list after dinner.  I guess
Zope 3 has a much more varied user base than Zope 2; nobody ever
complained about this for Zope 2, even though it's the same test
there.  I had fixed this for Mac OSX with a platform check, but a
better fix is to simply skip the test (silently) when the setlocale()
call fails.

Christian Zagrodnick pointed out a bug in the NBest calculation
during the writing of the unit tests (which I initially papered over
by changing the test): when two documents have exactly the same score
(as they did in the first version of the unit test) a batch of size 2
returns them in reverse order relative to the input, while a batch of
size 1 returns the first one.  That breaks my batching algorithm.
Fortunately I found out that this can be fixed by using bisect_left
instead of bisect_right!  I had been speculating that it could be
fixed by changing one comparison from e.g. < to <= or vice versa, and
that's pretty much what this does.  Now docids with equal score are
always returned in the original sequence order, independent of batch
size.

Lesson learned: add more unit tests.  We'll learn that over and over.

To make sure it really worked, I banged on mhindex.py until I could
index my inbox and query against it.  This was pretty straightforward
-- the only thing that took time was to figure out where
get_transaction() was.  It must now be imported from Transaction.

[Wednesday morning]

Overnight, I had figured out that there was still a bug in the Unicode
support of the splitter pipeline, which could be fixed by using (?u)
instead of (?L) -- i.e. re.UNICODE instead of re.LOCALE.  Christian Z
helped me produce a nice Unicode unit test using \N{...}.

Next we made the start and count arguments to query() default to 0 and
None, meaning to return all results.  (Hmm, I should fix some unit
tests which currently pass 0 and 10.  [Done on Thursday])

Next, we started creating the Zope/App code to work with the text
index.  After checking it in as Zope/App/Indexes/TextIndex.py, we
found that the other team on our group (who are doing the yellow boxes
for SteveA -- I still don't understand that part) had planned a
TextIndex subdirectory there.  After a while Jim suggested that we try
out the new naming conventions.  We changed to
Zope/App/index/text/index.py, and later put every else in the same
directory (even browser view stuff).  The double use of "index" is a
bit weird (I now think the first one should have been indexing), but I
love having all the parts in one directory.  All interfaces are in
interfaces.py.

The first thing we did was the notification interface.  There's an
extremely generic ISubscriber interface, which defines one method that
you must implement to be a subscriber, notify(event).  We implement
this in a class TextIndex, which extends TextIndexWrapper.  The
notify() call must switch on the type of the event by using
e.g. IObjectRegisteredHubEvent.isImplementedBy(event).  Once you know
the event type you can access the attributes (hubid, object) defined
by that interface.  To get the text out of the object, you must
*adapt* it to ISearchableText, and then call the adapter's
getSearchableText() method.  The adaptation may not work.  To deal
with this, there are two APIs with the same arguments (a common Zope 3
rhythm): when there's no match, getAdapter() raises an exception
(which one?) while queryAdapter() returns None.  (I guess Python's
dict.get() is an exception, and getattr() is half an exception. :-)
But in order to use either of these, you need a traversal context,
which is done by declaring the method notify() to be a ContextMethod.
Unfortunately the convention is then also to use wrapped_self instead
of self, which causes endless bugs (I keep typing self) and doesn't
seem to serve much of a purpose AFAICT.  I may undo this (the
wrapped_self naming convention).  ContextMethods are a bit like const
in C++: if you call a ContextMethod, you must be one yourself, so they
multiply like rabbits.

Testing notify() is a bit of a letdown: you create an event object of
a specific class (there are implementations matching all the diverse
interfaces -- what a waste) and then call the notify() method,
checking indirectly that the right thing (indexing, reindexing, or
unindexing) happens.  No ObjectHub is used in the test.

Tip: create a TAGS file with etags -r or ctags -r (I believe);
[ec]tags has Python support these days.  This makes finding the
various classes, functions and interfaces much easier.  (I used
Tools/scripts/eptags.py, which works too.)

I asked Steve if he could show me how to write a similar test that
involves a real ObjectHub, subscription, etc.  He was almost offended,
because in his eyes this is a functional test.  But he showed me how
anyway. :-)

First of all, the Test class must inherit from PlacefulSetup, and
setup() must call PlacefulSetup.setUp(self).  This is so that adapter
registration etc. works, apparently (it wasn't explained very well).
PlacefulSetup is a mixin, so you must also still inherit from
TestCase.  (Why???  Sounds like a YAGNI to me.  Of course there are no
docstrings.)

The hardest part is faking out the traversal mechanism (which I did
*not* want to set up for real).  I ended up creating a fake traverser,
implementing ITraverser, which has a travers() method that calls
locationAsUnicode() on its path argument, and then returns a
pre-determined object.  Otherwise it must raise KeyError.

This all is needed, BTW, because the ObjectHub implements a two-way
mapping between hub ids (ints that it makes up) and locations (in
canonical form, unicode strings of the form u"/path/to/an/object").
You can't play with an object hub without working traversal.

To establish our fake traverser (which effectively implements a tiny
fixed namespace, consisting of the path u"/bruce" and a single object
at that location), we call provideAdapter(None, ITraverser, factory)
where factory is a one-argument lambda that returns our traverser
instance.  (Long live nested scopes.)  None represents the source
interface -- I guess this means it can adapt anything.  The factory
argument is probably the object to be adapted -- our fake traverser
doesn't need it.

This is mostly needed because the hub's register() call takes only a
location -- it uses traversal to get the object (lazily, no less).
The register() method returns the hubid.  There's also an unregister()
method, also taking a location.  To generate a "modified" event, we
must create an event -- not an ObjectModifiedHubEvent, but an
ObjectModifiedEvent -- and pass it to the hub's notify() method.  The
hub then broadcasts it to all its subscribers (for such events).

The test works beautifully.

Next, we added subscribe() and unsubscribe() methods to our TextIndex
object.  The idea is that these (un)subscribe the TextIndex to the
object hub, or to some other event channel (the object hub is an event
channel) given by a location.  SteveA has a plan for an event channel
that is in series with the object hub, to decide which objects should
be indexed.  But since that's not ready yet I decided to punt on the
location for now.  Getting "the" object hub is hard enough: it's
registered as a service, so you must get it by calling
getService(wrapped_seolf, "ObjectHub"), so that means more
ContextMethods.  For unit testing, I decided not to try and set up a
service registration, but to simply have an optional channel argument
to [un]subscribe() that is passed by the test.  Of course, there waas
a bug in the code that didn't get tested this way: the getService()
call was checked in with "ObjectHubService" as its second argument,
which was how SteveA had remembered it.  Easily fixed once we started
"live testing".

So now we wanted an actual honest-to-God TTW user inteface for
interacting with the text index!  Christian Z showed me a trick,
tal:omit-tag="" that I don't remember; it may be new or I'm getting
old.

Once Jim shows how to do it in ZCML, creating a TextIndex from the
"Add" menu in the Service management area is easy enough.  Read
Zope/App/index/text/configure.zcml.  We treat the TextIndex class as
content.  It's convenient that the constructor takes no arguments;
then the <content> directive and the <factory> directive therein do
most of the needed configuration.  Apparently the factory id must be
globally unique, in a namespace reserved for factories.  The
convention is to use your full package name with something inside it,
in our case "Zope.App.index.text.factory".  The other part is the
<browser:menuItem> directive.  Its menu="add_component" attribute
tells us that this goes in the service creation menu.  Its action
attribute can either be something (I forget what -- probably a
location) that is called to do the creation, or a factory id.  Since
we need no arguments for the creation, we use our factory id.
Unfortunately, the factory id is also used as the default name if we
don't specify a name in the Add menu form.  The title and description
attributes show up in the menu; the "for" attribute must always be
for="Zope.App.OFS.Container.IAdding." (yes, trailing dot).  I heard
Jim say later that maybe there should be a special directive,
e.g. <browser:addMenuItem> (?).

Two more bits of ZCML are <browser:defaultView> which tells us the
name of the default view for a specific interface (currently named
IStatistics, but it should really be renamed), and <browser:view>,
which defines a specific view -- whose name must match the default
view name because that's what we're defining.

There are permission attributes on several of these: <browser:view>
and <factory>, and also the <content> directive has a <require>
directive which specifies the permission needed for a specific
interface (meaning all its methods) and a set of attributes (used
later to add a simple query interface here, for debugging purposes).
I believe now that this <require> directive is the main reason why we
needed to define IStatistics.

The page templates uses this:

  <span tal:replace="context/documentCount" />

to display the number of indexed documents, and similar for the number
of words.  This calls the documentCount() method, defined on the
TextIndex object and in the IStatistics interface mentioned in the
<require> ZCML directive -- otherwise you couldn't call it.

Jim had to fix a bug introduced by one of the other sprint groups; it
was very confusing, because the error message complained about a
missing "index" attribute, which was a word used in several places by
our ZCML (and as the parent package name).  But when we unhooked our
ZCML, it still happened (when rendering the service Add menu) so then
Jim realized he could reproduce and fix it on his own laptop.  Which
he did.

[After dinner]

Initially, we had the browser specific parts (a page template and a
ZCML file) in a browser subdirectory, but Jim preferred to get rid of
that, and I agree.  It's much easier to have everything together in
one directory.  After dinner I fixed this, also renaming the page
template and view from stats.html to control.html, because of the next
task.

Now was the time to hook up the [un]subscribe operations to the web
form.  We needed to add a method to access the subscription state to
IStatistics, so it could be displayed.   CZ showed a nice trick (or
did we get it from peeking at the LocalEventService?).  The form
points back to itself (action=".").  Actions are executed by the page
template when rendering the next incarnation of the form: using
<span tal:condition="request/callSubscribe|nothing"> it tests if the
callSubscribe variable is set it the request, by the Submit button in
the form:

  <input type="submit" value="Subscribe" name="callSubscribe" />

Inside the <span>, the subscribe method is called (with no arguments)
by using it in an expression which is assigned to a dummy variable:

  <span tal:define="dummy context/subscribe" />

Note that in the template, request/... refers to the request
variables, and context/... refers to methods of the context -- in our
case, the textindex object.  These calls pass a wrapped self, of
course.  Instance variables of the context can also be referenced, if
you have permission -- I think callable() is still used to decide
whether to call or not, although TAL may also have another way
(something involving "nocall" ???).

Getting this to work was hard work, but satisfying.  There are many
mysterious things you have to do in the TTW UI, assuming you start
with an empty Data.fs:

- Go to http://localhost:8080/manage -- log in as a user with Manager
  permission (I never had to do the latter because Mozilla remembers)

- (Why are the hyperlinks in the form not underlined?  Bah!  [Fixed
  this Thursday.])

- Click "Allow Services"; it changes into "Services" [As of Thursday,
  this is now called "Make Local Service Manager"; it changes into
  "Local Service Manager"]

- Click "Services" ["Local Services Manager"]; this goes to the
  services config page

- Click "default package"; this goes to a part of the services
  namespace called the default package (the purpose of packages is
  still vague; you can also get there by clicking on Packages and then
  on 'default')

- There's at least one item here, 'configure', which will become
  important later; for now, ignore these

- Click on the "Add..." link; this takes you to an Add menu.  The Text
  Index is shown as one of the items (the last one in my case)

- Select the Text Index radio button, and type a name
  (e.g. "textindex") in the input box, and click the "Add" button

- You are now back in the default package display; a new item has
  appeared, which is the textindex you just added.

- Now we have to create and configure two more services: a local event
  service and an object hub.  To create these:

- Click on Add...

- Select the Event Service radio button, and click the Add button

- Click on Add...

- Select the ObjectHub radio button, and click the Add button

- Now click on the 'configure' item.  This shows a row of buttons

- Click on the Add button; this takes you to a misformatted Add menu

- select the Service radio button (the only one that's there) and
  click the Add button below it

- You're now on a big form.  Select Events from the Service Type
  dropdown menu; you can leave the rest blank.  Click on the Next
  button.

- You're on a similar form, with three new radio buttons.  Select
  Active, and click Submit Query.

- The Event Service is now configured.  You're back on the page with
  the horizontal row of buttons, but a box showing the Events service
  is dispayed above it

- Click Add again, and go through the whole rigmarole again, this
  time selecting ObjectHub as the service type.  At the end, you're
  back at the button row, and now both the Events service and the
  ObjectHub service are displayed above it

- You can now click on 'default' in the breadcrumbs at the top; you'll
  see that there's an Events service, an ObjectHub service, a
  'configure' thingie, and a 'textindex' item.  Click on the latter.

- This shows the TextIndex control page.  It shows the number of
  documents and words, subscription status with a "Subscribe" button
  next to it, and a text box with a "Query" button next to it.  The
  subscription status can be toggled between OFF (the initial state)
  and ON; when ON, the button changes to "Unsubscribe".

- There are probably no documents at this point.  You can still verify
  that the Query button does something: type a query, click the Query
  button, and see that there are 0 results.  A query syntax error
  (e.g. unbalanced parentheses) causes a traceback.  I suppose I could
  catch these with something like "tal:on-error" -- maybe CZ can help.
  I should probably rename the button to Test Query to avoid
  disappointments.

The Query form is implemented similarly to the Subscribe button; it
uses a python expression to call the query() method with the queryText
request parameter.  request/queryText is also used for the initial
value of the text box, so you can edit your query.  The Subscribe
button is in a different form; unfortunately this means that when you
hit it, the previous query text is lost.  Maybe they should be one
form; but I want Subscribe to use POST while Query can use GET...

Interesting detail: when you edit a .py file, you must restart z3.py
(which takes forever on Mac OSX).  But when you edit a .pt (ZPT) file,
all it takes is reloading the page to get the new template.  Makes
experimental hacking with ZPT a joy, compared to fixing bugs in the
Python code...

Jim doesn't want us to use Python expressions in TAL, and SteveA was
also skeptical (though delighted to see it work).  A more flexible
approach would create a view object in Python whose methods are
invoked from the form.  I'll have to learn how to do this tomorrow.
[Didn't get to this.]  There was also a problem: the return value of a
query() with no results (the only kind I've tried so far, since no
events are generated yet) is the Python expression ([], 0) but this is
wrapped in security proxies, and somehow applying str() or repr() to
that proxy always gives us a string like <security proxy ...> .  So I
ended up displaying query(...)[1], which is a "rock" (an immutable
object that doesn't get wrapped).

I also added an adaptor from IReadFile to ISearchableText, which
extracts the data from the file (only for text/plain files).  But
right now, something's broken and you can't create files from the Add
menu.  This seems to be because of the Events or ObjectHub services;
SteveA is debugging ths but ran into a snag in the event channel
machinery when the building had to be closed at 11 pm.

[Thursday]

Too tired to report more for now.  Steve, Jim & I spent much of the
morning debugging the problem that Steve ran into last night.  In the
end, it appears that a few weeks ago, Steve had added code that an
object is wrapped to get its location; this was wrong since e.g. the
root isn't wrapped.

We did hook up all the indexing machinery, so you can now create
searchable documents and they will be indexed.  A searching UI is
still missing; the textindex management page has a test query box you
can use to prove that it works.  One necessary step that wasn't listed
above: after creating the object hub, in the default package, you need
to create a Registration component (the very simple one that Steve
implemented has the policy that all objects are registered), and
toggle its subscription status to ON in its contgrol menu.  Like the
text index, it's not a service so doesn't need to be configured.

Had an hourlong session with Alexander Limi about improving the UI for
service management.  Conclusion is that the current approach is so bad
(from a usability perspective) that we need to improve it before he
can even start helping with the UI work.  One idea is to create a
special action on the Service Manager page to create, configure and
activate "essential services" (events, object hub and what else).

Successful demo Thursday night.

--Guido van Rossum (home page: http://www.python.org/~guido/)