[Zope] attribute used to index PDFs?

Andreas Jung lists at andreas-jung.com
Fri Feb 24 03:50:58 EST 2006



--On 12. Dezember 2005 14:54:09 -0500 "Garth B." <garthb at gmail.com> wrote:

> On closer inspection, the Word docs aren't actually being indexed
> appropriately either.  When I browse the vocabulary for these indexed
> Word docs, I happen to see textual content that can be seen by also
> cat'ing the document to the stdout.  The vocab includes other strings
> that certainly are not content.  I guess they're string
> representations of binary content.
>
> These are other things that I noticed, maybe they won't amount to
> anything:
>
> - When I watch the processes during indexing w/top I don't see wvWare
> or pdftotext appear.  Maybe they won't.
>
> - I also inserted a couple of LOG.warn's in src/textindexng/content.py
> around line 130 (  if d.has_key('mimetype'):  ), and this test always
> fails, thereby skipping conversion.
>
> - Digging further in this file, "mimetype" is only defined when
> extract_content() in content.py calls "icc.addBinary(...)".  This only
> happens when the indexed object provides a txng_get() hook (or I
> suppose if an adapter exists).  That whole block (around lines 81 -
> 93) never gets hit with my PDFs or Word docs during indexing.  When I
> index a large number of PDFs I will get a number of TypeErrors raised
> around line 110 when extract_content() notices that the data isn't a
> [unicode] string.
>
> Is the standard Zope File object supposed to expose a txng_get hook?
>
> On 12/12/05, Garth B. <garthb at gmail.com> wrote:
>> Hi Andreas,
>>
>> Neither PrincipiaSearchSource nor SearchableText does anything for
>> these File-type objects.  I guess nothing for SearchableText is
>> expected since these are not CMF or Plone-derived objects.  The only
>> way I've managed to get *anything* indexed for these File-type objects
>> is by specifying the "data" attribute.
>>
>> A couple of related postings that I've found through a bit of Googling
>> have also noted having to use "data" when indexing these kinds of
>> files, for example:
>> http://mail.zope.org/pipermail/zope/2003-August/139702.html
>>
>> So, I should be able to use PrincipiaSearchSource?  I've only used
>> that for text-oriented objects like Page Templates.  I'll keep digging
>> around, but I welcome any suggestions for what the problem could be or
>> how I can debug this further.

Maybe you should bring this to TXNG bugtracker (as suggested!).

-aj


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 186 bytes
Desc: not available
Url : http://mail.zope.org/pipermail/zope/attachments/20060224/3421f3b7/attachment.bin


More information about the Zope mailing list