[Zope-CVS] CVS: Products/Ape/doc - outline.txt:1.1 tutorial_slides.sxi:1.1

Thu, 27 Mar 2003 09:28:32 -0500

Update of /cvs-repository/Products/Ape/doc
In directory cvs.zope.org:/tmp/cvs-serv20005

Added Files:
	outline.txt tutorial_slides.sxi 
Log Message:
Added PyCon outline and slides

=== Added File Products/Ape/doc/outline.txt ===

ApeLib Tutorial Outline

I. Purpose of ApeLib

  A. Differences Between Object-Oriented and Relational Databases

    The differences between relational databases and object-oriented
    databases lie in their flexibility. To store data in an RDBMS, you
    must first define the complete structure of your data. For
    example, if you wanted to store phone numbers, you would first
    create a table.  Then in that table you would set up a few columns
    including "name" and "phone_number". You would then write a
    program that can interact with those specific columns.  If you
    later decide you also want to store people's email addresses, you
    have to add another column and change your program as well.

    Storing data in an OODBMS does not require defining the structure
    ahead of time. You only have to write your program, then connect
    your program to the OODBMS with a few instructions, and you're
    finished. The OODBMS takes advantage of the structures you use
    naturally when creating your program, and it simply stores the
    structures. It is often faster and easier to write a program for
    an OODBMS than for an RDBMS.

    However, RDBMSs are very popular.  Major vendors like Oracle,
    Sybase, IBM, Borland, and others, all sell RDBMS software.
    Computer science courses in practically every university teach
    development and administration of RDBMS-based software. RDBMSs
    have certain advantages derived from their mathematical
    foundations, such as the ability to search for data based on
    previously unanticipated criteria.  Also, years of competition in
    the RDBMS market have led to refinements in reliability and
    scalability.

  B. ZODB

    One of the great strengths of Zope, a Python web application
    server, is its database technology called ZODB.  ZODB is a Python
    object-oriented database.  Software development using ZODB is fast
    and easy.  When you write software based on ZODB, you can
    generally pretend that your program never stops, never crashes,
    and never has to write anything to disk.  ZODB takes care of the
    remaining details.

    However, there are many good reasons to use a relational database
    instead of ZODB.  People are already familiar with relational
    databases.  ZODB is only accessible through the Python programming
    language, while relational databases are more language-neutral.
    Relational databases can more easily adapt to unexpected
    requirements.  And because they have been around longer,
    relational databases can often hold more data, read and write data
    faster, and maintain full-time operation better than ZODB
    storages.

  C. Bridging the Gap

    For a long time, people have requested better relational
    integration in Zope.  Zope has limited relational integration: you
    can open connections to an RDBMS and store and retrieve data,
    including objects.  But objects from the RDBMS never reach
    "first-class citizenship" in Zope. Zope does not allow you to
    manipulate these objects as easily as you can work with objects
    stored in ZODB.

    There are backends for ZODB that let you store pickled objects in
    relational databases. This solution satisfies those who need to
    store large amounts of data, but the data is stored in a special
    Python-only format. It prevents developers from taking full
    advantage of relational data storage and locks out other
    programming languages.

    ApeLib bridges the gap between ZODB and relational data storage.
    It lets developers store ZODB objects in arbitrary databases and
    arbitrary formats, without changing application code.  It combines
    the advantages of orthogonal persistence with relational storage.

  D. Current Limitations

    To facilitate distribution, ApeLib is currently a Zope product.  This
    makes it difficult to reuse outside Zope.  But work is underway to
    separate it from Zope, starting with the creation of a top-level
    Python package called apelib.

II. Components

  A portion of Martin Fowler's book "Patterns of Enterprise
  Application Architecture" describes patterns used in mapping objects
  to relational databases.  A lot of the names used in ApeLib come
  from the book.

  There are many kinds of components in ApeLib, but to store new kinds
  of objects or store in new formats, you generally only need to write
  components that implement one of two interfaces: ISerializer and
  IGateway.  This tutorial focuses on these two kinds of components.

  A. Mappers

    ApeLib uses a tree of mappers to map objects to databases.
    Mappers are components that implement a simple interface.  Mappers
    serialize, deserialize, store, load, classify, and identify
    objects.  Mappers and their associated components are reusable for
    many applications needing to store and load objects, but the
    framework is especially designed for mapping persistent object
    systems like ZODB.

    Most mappers are responsible for loading and storing instances of
    one class.  Mappers separate serialization from storage, making it
    possible to reuse serializers with many storage backends.  A
    mapper supplies a serializer, which extracts and installs object
    state, and a gateway, which stores and retrieves state in
    persistent storage.

  B. Basic Sequence

    To load an object, ApeLib requests that the composite gateway of a
    specific mapper load data.  Composite gateways delegate the
    request to multiple specific gateways.  The specific gateways each
    query the database and return a result.  The composite gateway
    combines the results into a dictionary that maps gateway names to
    the results from the data store.

    Then ApeLib feeds that dictionary to the composite serializer of
    the same mapper.  The composite serializer delegates the work of
    deserialization to multiple serializers.  The serializers install
    the loaded data into the object being deserialized.  Finally,
    control returns to the application.

    When storing objects, the system uses the same components, but in
    reverse order.  The composite serializer reads the object and the
    results are fed to the composite gateway, which stores the data.

    ZODB is the key to loading and storing objects at the right time.
    The Persistent base class arranges for a separate data manager
    object to load the state of an object only when it is needed.  The
    Persistent base class also notifies the data manager when an
    attribute of a managed object changes.

  C. Schemas

    Schemas define the format of the data passed between serializers
    and gateways.  ApeLib defines three basic schema classes and
    allows you to use other kinds schemas.

    A FieldSchema declares that the data passed is a single field,
    such as a string or integer.  FieldSchema is appropriate when the
    serializing data of a simple type.  When using a FieldSchema, the
    state passed between serializers and gateways is the raw data.

    A RowSchema declares a list of fields.  RowSchema is appropriate
    when serializing multiple fields.  When using a RowSchema, the
    state passed between serializers and gateways is a tuple of
    values.

    A RowSequenceSchema declares a list of rows of fields.
    RowSequenceSchema is appropriate when serializing multiple rows of
    fields at once.  When using a RowSequenceSchema, the state passed
    between serializers and gateways is a sequence of tuples.

    The only requirement ApeLib makes of schemas is that they
    implement the Python equality operation (__eq__), allowing the
    system to verify that serializers and gateways are compatible.
    You can use many kinds of Python objects as schemas.

  D. Gateways

    Gateways load and store serialized state.  The gateways you create
    can store data anywhere and in any format, as long as you obey a
    few simple

    The state returned by the gateway's load() method must conform to
    the schema declared by the gateway.  Conversely, the gateway can
    expect the state passed to the store() method to conform to that
    same schema.

    The gateway must generate a hash of the stored state, allowing the
    system to detect transaction conflicts.  The hash is returned by
    both the load() and store() methods.  Hashes don't need to be
    integers, but it must be possible to convert hashes to integers
    using Python's hash() function.

  E. Serializers

    Serializers do the work of both pulling data out of an object and
    pushing data into it.  The serialize() method reads the internal
    state of an object without changing the object.  The deserialize()
    method installs state into an object.

    Proper serialization must answer certain questions.  To answer
    these questions, serializers receive event objects as arguments to
    the serialize() and deserialize() methods.  By interacting with
    the events, the serializer affects the serialization and
    deserialization processes to get the proper behavior.

    1. What if the serializer forgets to store an attribute?

      To avoid forgetting attributes, serializers indicate to the
      serialization event which attributes and subobjects they
      serialized by calling the notifySerialized() or
      ignoreAttribute() method.  (The difference between the two
      methods will be explained in a moment.)  At the end of
      serialization, a final serializer may look for any remaining
      attributes.  If there are any attributes left over, the final
      serializer may choose to either put the rest of the attributes
      in a pickle or raise an exception indicating which attributes
      were forgotten.

    2. What if two attributes refer to the same subobject under
    different attribute names?  In general, what if an object refers
    to a subobject in more than one way?

      Referring to a subobject in more ways than one is usually not a
      problem.  If one serializer serializes both references, that
      serializer can deal with the issue in its own way.  The more
      interesting problem is that a serializer may serialize only one
      of the references, leaving the other to be serialized by the
      remainder pickle.  If you're not careful, the remainder pickle
      could generate a second copy of the subobject upon
      deserialization.

      To deal with this, serializers call the notifySerialized() event
      rather than the ignoreAttribute() method.  The
      notifySerialized() method provides the information needed by the
      final serializer to restore references to the correct subobject.
      For this to work, serializers also need to call
      notifyDeserialized() in their deserialize() method, so that the
      unpickler knows exactly what subobject to refer to.

    3. Is it possible to avoid loading the whole database into RAM
    when deserializing?  Conversely, after making a change, is it
    possible to serialize the state of only the part of the object
    system that has changed?

      Working with only a part of the object system is one of the core
      features provided by ZODB.  ZODB assigns an object ID to each
      persistent object to match objects with database records.  When
      you load a persistent object, ZODB loads the full state of only
      the object you need, and when you change a persistent object,
      ZODB stores only the corresponding database record.

      During serialization, serializers use three methods of the
      serialization event to make references to other database
      records.  Serializers first call identifyObject() to find out if
      the subobject is already stored in the database.  If it isn't,
      the serializer should call makeKey() to generate an identity for
      the new subobject.  In either case, the serializer then calls
      notifySerializedRef() to tell the event that it is storing a
      reference to another database record.

      During deserialization, serializers can use the dereference()
      method of the deserialization event to refer to objects from
      other database records without loading the full state of the
      objects.  The returned subobject may be in a "ghosted" state,
      meaning that it temporarily has no attributes.  (When you
      attempt to access any attribute of a ghosted object, ZODB
      transparently loads the object before looking for the
      attribute.)

    4. What if the record boundaries set up by the serializer don't
    correspond directly with ZODB objects?

      ZODB makes an assumption that isn't always valid in ApeLib: ZODB
      assumes that objects that derive from the Persistent base class
      are database record boundaries.  In ApeLib, sometimes it makes
      sense to serialize several Persistent objects in a single
      database record.

      However, when you serialize more than one Persistent object in a
      single record, you create what are called "unmanaged" persistent
      objects or "UPOs".  If the serializer does tell ApeLib about the
      UPOs, ZODB will not see changes made to them and transactions
      involving changes to those objects may be incomplete.  So during
      both serialization and deserialization, it is important for
      ZODB-aware serializers to call the event's
      addUnmanagedPersistentObjects() method.

    ApeLib provides some useful standard serializers:

      - The remainder serializer pickles and restores all the
      attributes not stored by other serializers.  This is useful for
      development and simplifies the tree of mappers.

      - The roll call serializer verifies that every attribute of an
      object was serialized.  If any are forgotten, it raises an
      exception.  This is useful when you don't want to use a
      remainder serializer, but you don't want to lose any attributes
      either.  The roll call serializer stores nothing, so it does not
      need to be paired with a gateway.

      - The optional serializer is a wrapper (decorator) around a real
      serializer.  The optional serializer asks the real serializer if
      it is able to serialize or deserialize an object (using the
      canSerialize() method).  If the test fails, the optional
      serializer ignores the failure and falls back to a default.

      - The "any class" serializer is a composite serializer which,
      unlike the standard "known class" composite serializer, can
      serialize and deserialize objects of any class.  During
      deserialization, it defers the creation of a class instance
      until the classification of the object is known.  The "any
      class" serializer incurs performance penalties, but it allows
      ApeLib to work with heterogeneous object systems like Zope.

    Serializers access the innards of objects, often breaking
    encapsulation because the serializers need to know exactly what
    attributes the objects use.  To avoid breaking encapsulation,
    objects might implement part of the serialization process
    themselves.

  F. Classifiers

    With all this talk of heterogeneous object systems, two important
    questions have not been answered yet.  How do you choose what kind
    of object to create when loading a database record?  And how do
    you choose what kind of database record to create when storing an
    object?  When working with relational databases, these are not
    usually difficult to answer, but in the world of MIME types,
    filename extensions, and peer-to-peer distribution, it's more
    difficult.  The logic for choosing mappers must be componentized.

    Classifiers are the components that choose which mapper to use for
    an object or database record.  Classifiers can be simple, always
    using a specific mapper for specific OIDs or storing the name of
    the mapper in the database.  Classifiers can also be complex,
    using attributes or metadata to make the choice of mapper.

    The root mapper holds the main classifier.  ApeLib consults the
    main classifier when loading and storing any object except the
    root object.  For Zope 2, the main classifier, a
    MetaTypeClassifier, is fairly complex, involving meta_types,
    filename extensions, and class names.  Fortunately, the
    MetaTypeClassifier is the only component that knows about
    meta_types and so forth, so other applications that use ApeLib do
    not need all that complexity.

    Classifiers also work with "classifications".  Classifications are
    dictionaries mapping strings to strings.  Classifications contain
    information that might be useful for choosing object and database
    record types.  Unlike the rest of the state of an object,
    classifications do not need to be precise.

    When loading an object, ApeLib calls the classifier's
    classifyState() method.  The classifier may choose to load
    information from the database to discover the type of database
    record.  (It usually does this using a gateway private to the
    classifier.)  classifyState() returns a classification and mapper
    name.

    When storing an object, ApeLib calls the classifier's
    classifyObject() method.  The classifier may choose to examine the
    object or it may know enough just by the keychain assigned to the
    object.  classifyObject() returns a classification and
    mapper_name, but it should not store the generated classification
    yet.  ApeLib later calls the store() method of the classifier, at
    which point the classifier has the option of storing the
    classification.  (This separation exists so that serialization and
    data storage can theoretically occur on different machines, which
    ZEO does.)

III. Example: Mapping Zope 2

  ApeLib provides two default Zope 2 mappers.  One maps to the
  filesystem and the other maps to a PostgreSQL database.  Because
  there is a lot in common between the two mappers, the createMapper()
  function in the basemapper module sets up the mappers and
  serializers, while two derivative functions set up the gateways.

  The PostgreSQL mapper uses the Psycopg module to connect to the
  database.  It uses integers as keys and puts information about
  each object in several tables.  All objects have an entry in the
  classification table.  The PostgreSQL mapper uses a simple schema,
  but ApeLib is not limited to this schema.

  The filesystem mapper stores data in a directory and its
  subdirectories.  It uses paths as keys and puts information about
  each object in up to three files.  The filesystem mapper both
  recognizes and generates filename extensions, but it can also work
  without filename extensions.

  Normally, ZODB caches objects indefinitely.  This leads to excellent
  performance, but prevents the object system from having the most
  current data all the time.  One workaround is to set the ZODB cache
  size to zero, forcing ZODB to clear its cache after every
  transaction.  But that solution eliminates the ZODB performance
  advantage, so ApeLib needs a better solution.  Nothing specific is
  planned yet.

  To extend the Zope 2 mappers with your own mappers, you can write a
  function that first calls the standard mapper factory and then
  adds to the generated mapper tree.

IV. Multiple domains

  Until now, this paper has assumed that given nothing more than an
  object or a database record, ApeLib can choose a mapper for that
  object.  That assumption is reasonable until you start using generic
  object types for many parts of an application, and you need to store
  the generic objects differently depending on what part of the
  application is using them.  For example, ZODB BTrees are reusable
  for many purposes, but storing both catalog indexes and user records
  in the same database tables would not be sensible.

  Also note that the acquisition wrappers and context wrappers
  normally available in Zope are not available when loading and
  storing objects.  ZODB works with bare objects, so no wrappers are
  available to discover the context of an object while loading and
  storing it.

  Therefore, ApeLib provides a different facility for preserving the
  context of objects and database records.  Instead of looking up a
  mapper by key, ApeLib uses a list of keys or "keychain" to visit a
  tree of mappers.  Specifically, to find the right mapper, ApeLib
  asks the classifier of the root mapper to choose a mapper, then it
  asks the classifier of the chosen mapper to choose a mapper, and so
  on, until it has followed each key in a keychain and arrived at the
  right mapper.

  ApeLib calls mappers that link to other mappers "domain mappers".
  Not all mappers are domain mappers.  The root mapper is a domain
  mapper, but currently no other mappers in the Zope 2 example are
  domain mappers.

  Unlike simple mappers, domain mappers provide a classifier, a
  keychain generator, and sub-mappers.  Classifiers have been
  discussed before.  Keychain generators isolate the logic of
  generating keychains from serializers and gateways.  Serializers and
  gateways can generate their own keychains if they want, but
  serializers are more reusable when they remain independent of the
  contents of keys and keychains.

  Note that the tree of object mappers does not necessarily look like
  the tree of objects in an application.  Even though Zope stores a
  tree of objects like a filesystem, most of the mappers used in
  ApeLib's Zope 2 mapper are attached to the root mapper.  In Zope,
  most kinds of containers can contain most kinds of objects.  A tree
  of Zope object mappers could be confining, permitting certain
  objects to be stored in only certain kinds of containers.

  However, for some applications, containment constraints might be a
  major benefit.  Besides helping consistency, domain mappers
  encapsulate object mapping details in smaller, independent objects.
  Domain mappers minimize the possibility of collision with other
  parts of the application that want to map objects to the same
  database.  But avoid excessively long keychains, since ApeLib must
  examine each key in a keychain repeatedly.

  As an alternative to keychains with multiple keys, applications
  might instead set up a separate data manager for different parts of
  the application.  This strategy allows domain-specific caching
  strategies, but it also sacrifices some amount of database
  independence.

V. Ways to use the framework

  ZEO: ApeLib separates serialization from data storage, making it
  possible to perform serialization on a ZEO client while data storage
  happens in a ZEO server.  ZEO also adds the ability to keep a very
  large object cache.

  Zope 3: ApeLib is currently designed with Zope 2 in mind, but meant
  to be reusable for Zope 3.  A new set of mappers will be needed, but
  nearly all of the interfaces should remain unchanged.

  Non-Zope applications: ApeLib is a distinct library useful for many
  ZODB applications.  ApeLib makes it easier to map objects to any
  data store.

  Finally, the framework is useful for many purposes outside ZODB.
  Once you have built a system of mappers, you can use those mappers
  to import and export objects, synchronize with a data store, and
  apply version control to your objects.  The concepts behind ApeLib
  open exciting possibilities.

=== Added File Products/Ape/doc/tutorial_slides.sxi ===
  <Binary-ish file>