[Zope-DB] KeyError from Shared.DC.DA.DA

Mon Sep 25 03:58:58 EDT 2006

Hi,

Under load, we're seeing KeyErrors from the _cached_result method of the DA
class in the code used to clear out cached results. We're using Python
2.3.5, Zope 2.8.5, mxODBCZopeDA-1.0.8 running on RHEL3 connecting to a
remote Microsoft SQL Server 2000 database.

Note this is a different issue to Collector #1745.

Here's the salient part of the traceback:

  File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line
118, in _getDBValue
    values = self._getDBValues(attributes=[attribute])
  File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line
96, in _getDBValues
    , userid=login_elements[1]
  File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line
453, in __call__
    result=self._cached_result(DB__, (query, self.max_rows_))
  File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line
375, in _cached_result
    if int(cache[q][0]) == key:
KeyError: ("execute dbo.EmployeeByNT 'DOMAIN', 'user'\n", 1000, '\nDBConnId:
None')

The ZSQL Method EmployeeByNT looks like this (it's actually an FSZSQL method
in a skin):

<dtml-comment>
connection_id: DB_Connection
arguments:domain userid
cache_time: 60
max_cache: 50000
</dtml-comment>
execute dbo.EmployeeByNT <dtml-sqlvar domain type="string">, <dtml-sqlvar
userid type="string">

We do have other ZSQL methods, but this one is by far the most heavily used.

Here's (most of) the offending method, _cached_result (I've snipped some
code which just appends the db connection id to the query. Note also that
Bucket is also defined as lambda: {} at the top of DA.py):

        # Try to fetch from cache
        if hasattr(self,'_v_cache'): cache=self._v_cache
        else: cache=self._v_cache={}, Bucket()
        cache, tcache = cache
        max_cache=self.max_cache_
        now=time()
        t=now-self.cache_time_
        if len(cache) > max_cache / 2:
            keys=tcache.keys()
            keys.reverse()
            while keys and (len(keys) > max_cache or keys[-1] < t):
                key=keys[-1]
                q=tcache[key]
                del tcache[key]
                if int(cache[q][0]) == key:
                    del cache[q]
                del keys[-1]

        if cache.has_key(query):
            k, r = cache[query]
            if k > t: return r

        # call the pure query
        result=apply(DB__.query, pure_query)
        if self.cache_time_ > 0:
            tcache[int(now)]=query
            cache[query]= now, result

        return result

In this code, 'tcache' is a dict which maps integers representing the time
to queries (the result of calling int(t) where t is the floating-point
result from time.time()) , and 'cache' maps queries to a 2-tuple consisting
of the floating-point timestamp of when the method was called, and the
result of the query.

I've been trying to figure out a circumstance which could cause the KeyError
we're seeing in production, without success. 

Here are some observations which may give some clues:

The result of time.time() is converted to an integer and used as a key into
tcache. AFAICS, this means that only 1 entry per second will be stored in
tcache, even though every query will be stored in cache. It also looks like
cache could be the source of a memory leak.

For example:
- 'SELECT a FROM b' and 'SELECT c FROM b' run in the same 1-second window 
- 'SELECT c FROM b' will end up stored in tcache. 
- Both queries end up in cache. 
- 'SELECT a FROM b' will only ever be removed from cache if 'SELECT a FROM
b' is executed once more later that second - and then 'SELECT c FROM b' will
remain.

We only see the KeyError after 'some time' in the application. Presumably
this is the time it takes for the caches to fill and start actively removing
entries as per the while loop. (We restarted the cluster last night, and
have yet to see the errors return).

The SQL query we are running maps an NT credential (domain and userid) to an
internal user id from our corporate database. To give an idea of the usage
pattern of this ZSQL method, it gets run at least once per request, with
application serving approximately 150,000 - 180,000 requests per day, with
12,000 - 14,000 distinct users (and hence different arguments to the ZSQL
method).

I'm going to put some logging code around the line that causes the KeyError
next time we do a release. However - has anyone seen this before? Can anyone
see what might be going wrong here?

Cheers,
Dan

--
dan.fairs at rbos.com | GBMT | Royal Bank of Scotland plc

***********************************************************************************
The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority 

This e-mail message is confidential and for use by the 
addressee only. If the message is received by anyone other 
than the addressee, please return the message to the sender 
by replying to it and then delete the message from your 
computer. Internet e-mails are not necessarily secure. The 
Royal Bank of Scotland plc does not accept responsibility for 
changes made to this message after it was sent. 

Whilst all reasonable care has been taken to avoid the 
transmission of viruses, it is the responsibility of the recipient to 
ensure that the onward transmission, opening or use of this 
message and any attachments will not adversely affect its 
systems or data. No responsibility is accepted by The 
Royal Bank of Scotland plc in this regard and the recipient should carry 
out such virus and other checks as it considers appropriate. 
Visit our websites at: 
http://www.rbos.com
http://www.rbsmarkets.com 
***********************************************************************************