[Zope] Advice on searching/indexing Word documents?

sean.upton@uniontrib.com sean.upton@uniontrib.com
Wed, 03 Jan 2001 08:25:04 -0800


I really like the idea of extending OFS:File to support different file
types, but what I would like to see is something that is
format/filter/library agnostic.  That is to say, that perhaps the way we
ought to go about this is to create an API framework that upon upload
filters the file with a specified filter for its mime-type.  Perhaps
creating a generic base class that implements a generic API for filtering a
file, from which to extend by inheriting more specific classes for files of
particular types or groups (fine grained to mime-type or grouped in
category, eg. "Illustration"). 

Having such a generic framework would enable Zope to be an excellent
platform for digital asset management; Suppose you had a class for all files
for a particular purpose, and those files would always be of a partiaular
set of mime-types, like Illustrator, PDf, or postscript. For example, if
someone working at a newspaper creates a new file class instance called
"DisplayAd," which is used for postscript files with embedded fonts,
containing specific text, a filter set up as part of the extended class for
DisplayAd file would detect the type of file, determine it was PDF, and
filter out the text, and the face names of the embedded fonts.  If the file
was a PDF or an AI file, it would then run the appropriate filter.

It might also be nice to have a extended class (inherited from file) that
works for all types, and keeps some sort of configurable plugin registry of
sorts, so that we can create plugin classes for specific mime-types, but
only have to use one class for the objects themselves.  This might be more
practical.

One thing that seems important: creating an API like this could allow us to
write filter "plugins" in a variety of Zope supported configs, like
completely in python, a python class extending a C shared library, something
written in a combination of C/Lex, or the python-based plex scanner that was
mentioned earlier - for that matter, even proprietary user-space binaries
called via python code might be fair game...

I really think that this idea has potential as a project, and would be
willing to contribute.

Sean

-----Original Message-----
From: Bjorn Stabell [mailto:bjorn@exoweb.net]
Sent: Tuesday, January 02, 2001 10:07 PM
To: zope@zope.org
Subject: RE: [Zope] Advice on searching/indexing Word documents?


This is something I've been longing for a long time.  Wvare is cool, and
it should also be able to access properties of many Windows (OLE)
documents, not just Word documents.

I've been thinking about extending the File class so that it becomes
aware of the different file types and allows access to (read/write) meta
data and indexing of the files' content.  If we can setup a nice
framework for it, I'm sure a lot of people could contribute code for
specific file formats.

Bye,
-- 
Bjorn

-----Original Message-----
From: Jens Vagelpohl [mailto:jens@digicool.com]
Posted At: Wednesday, January 03, 2001 11:28
Posted To: Zope List
Conversation: [Zope] Advice on searching/indexing Word documents?
Subject: Re: [Zope] Advice on searching/indexing Word documents?


if you're on linux check out WVWare:

http://www.wvware.com

it's a C library that handles all word doc formats since 6.0 or so

jens


On Tue, 02 Jan 2001, Bowyer, Alex wrote:
> Our company has a repository of staff CVs (Resumes) as Word Documents
and I
> am about to embark on creating a new feature for our Zope Intranet to
allow
> project managers to search those documents for keywords such as
particular
> skills or projects.
>
> I am thinking about several possibilities such as a skills/CVs
database
> linked in via ODBC, or some task that converts the Word documents to
text
> files which can then be searched by Zope (I think Zope can do this,
and I
> assume it can't search Word format directly?).
>
> Has anyone ever approached a similar problem, does anyone have any
tips on
> how to index/search a load of documents in Zope?
>
> Any tips/suggestions/comments would be most welcome.
>
> Thanks,
>
> Alex


_______________________________________________
Zope maillist  -  Zope@zope.org
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )


_______________________________________________
Zope maillist  -  Zope@zope.org
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )