[Zope3-Users] Indexing PDF files

Frank Burkhardt fbo2 at gmx.net
Thu May 11 02:02:19 EDT 2006


Hi,

On Wed, May 10, 2006 at 03:29:34PM -0500, Sreeram Raghav wrote:

[snip]

> Initially the only files being indexed were "ZPT pages", but after writing
> the adapter even text files were being indexed.
> However the problem is that when I try to add a PDF of Word documents, the
> files are not being indexed and showing an error that cannot decode files.

This adapter was just a demonstration on how to index a content object
containing a text field. It assumes that context.data contains just a plain
string. To index pdf files, you'll have to somehow convert the pdf data to
plain text:

from ModuleYouHaveToWrite import MagicPdfToText

class SearchableTextAdapter(object):
[...]
   def getSearchableText(self):
      text=MagicPdfToText(context.pdfdata)
      return (text,)

I don't know, if there's a pure python solution for extraction text from pdf files.
But you might consider calling an external program like 'pdftotxt' to do the job.
However, it's your adapters responsibility to act as define by the interface and
'ISearchableText' says, the adapter must provide plain indexable text.

Regards,

Frank


More information about the Zope3-users mailing list