[Zope-dev] Re: Spitter.c Hack

Jason Spisak 444@hiretechs.com
Sat, 06 Jan 2001 02:05:28 GMT


Erik,

> [Jason Spisak]
> 
> | I am running on a big machine though.  If anyone wants those changes
> | there's really easy.  Just mail me directly, since it's a long file
> | to post.
> 
> Hi.  I would be interested in the file :-).
> 

Okay, here's the diff. It truely is nothing more than cutting out the two
parts that eliminate single letter words and numbers:

*** Zope-2.2.4-src/lib/python/SearchIndex/Splitter.c 
--- Zope-2.2.4-src/lib/python/SearchIndex/Splitter_Old.c 
***************
*** 169,192 ****
      len = PyString_Size(word) - 1;
  
      len = PyString_Size(word);
-     /*if(len < 2)      Single-letter words are stop words!
-     {
-       Py_INCREF(Py_None);
-       return Py_None;
-     }     */
- 
-     /*************************************************************
-       Test whether a word has any letters.                       */
  
      for (; --len >= 0 && ! isalpha((unsigned char)cword[len]); );
-     /*if (len < 0)
-     {
-         Py_INCREF(Py_None);
-         return Py_None;
-     }
- 
-      * If no letters, treat it as a stop word.
-      *************************************************************/
  
      Py_INCREF(word);
  
--- 169,176 ----



> Would you also be willing to share some statistics on how many objects
> you have in how many indexes, and how much time "complex" searches
> take?  I do understand if this is not possible, but it'd be appetiated
> if it was possible. :-)
> 
> Thanks.

Well, here's the some output of the "Status" tab in the Catalog.

Subtransactions are Disabled

 Subtransactions

          ---------------------------------------------------------

Index Status

   * 48205 object are indexed in bobobase_modification_time
   * 48205 object are indexed in calendar_date
   * 48205 object are indexed in calendar_day
   * 48205 object are indexed in call_date
   * 48205 object are indexed in curators
   * 48205 object are indexed in data
   * 48205 object are indexed in id
   * 48205 object are indexed in meta_type
   * 48205 object are indexed in resume_in
   * 48205 object are indexed in status
   * 48205 object are indexed in users_calendar

The only TextIndex is the 'data' index though.  It is the one that gets
hammered.

Let's see...time stats...hmmm

I put a REQUEST.set with the ZopeTime at the top of the search page and at
the bottom after the 'in' tag for the Catalog. 

Search terms are:  los and angeles and C++ and MFC and 310

Subtracting the float of the two times I get 1.85400104523  I'm not sure
what that comes out to, I think it's part of a day though because of
DateTime.

The server stats:

Dual Intel 400mhz Xenon w/ 1MB cache each
LVD RAID 5 7200 RPM disk array
1GB RAM
RedHat Linux 6.1 with some kernel updates...
And the best piece of open source software I know:  Zope 2.2.4 binary
release
 
Hope that helps.


All my best,


Jason Spisak
CIO
    __ ___       ______        __
   / // (_)_____/_  __/__ ____/ /  ___  _______  __ _
  / _  / / __/ -_) / / -_) __/ _ \(_-<_/ __/ _ \/  ' \
 /_//_/_/_/  \__/_/  \__/\__/_//_/___(_)__/\___/_/_/_/

6151 West Century Boulevard
Suite 900
Los Angeles, CA 90045
P. 310.665.3444
F. 310.665.3544

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.