[Zope] Re: Indexing and plaintext display gives PDF errors

Casey Duncan cduncan@kaivo.com
Mon, 11 Jun 2001 11:10:16 -0600


Leigh Ann Hildebrand wrote:
> 
> I'm using Zope 2.3.2 with Python 1.5.2 running on Redhat. I don't use
> Python, I work in DTML. I'm cataloging technical documents. I do not
> use Document Library or the CMF, in part because of compatibility restrictions.
> (The site must support NetPositive, a non-javascript, non-CSS compatible
> browser.) The documents I'm indexing are html, text, Word, PowerPoint,
> and PDF files.

There isn't that much JavaScript in the DocumentLibrary Product, and it
gracefully handles non-JS browsers (The only part that doesn't work is
the index chooser, which uses Javascript to pass values between
windows). It would not be difficult to remove the JS in the DTML methods
provided by default, if necessary. 

> 
> I have the CMF and the Document Library product installed; I also had
> installed wvWare, though I'm not sure I installed it correctly. (The
> instructions were vague.)
> 
> This is my problem. When I update my Catalog, I get a number of errors
> on the linux box that runs my Zope installation, related to PDF files:
> 
> Error (0): PDF file is damaged - attempting to construct xref table ...
> Error: Top level pages is wrong type (null)
> Error: Couldn't read page catalog
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table

These look like errors coming from the PDF converter (pdftotext). Try
running the converter on one of these file manually at the command line.
IE:

% pdftotext some.pdf some.txt

To see if you see the same errors. If so, perhaps your version of XPDF
needs updating, or it is not compatible with the files you are providing
for some reason. 

> 
> These repeat a few times, giving me two screens worth, before the index
> updating is complete. I can think of at least one problem that might
> be going on here: I think some PDF documents were added as type "DocumentFile",
> which is related to the DocumentLibrary stuff.

DocumentFile objects are like Files, except they support the conversion
of PDF to text (among others) for indexing.

> 
> Anyway, I'm trying to get rid of the errors, and be able to index the
> text of PDF and Word files. Suggestions? I'm forwarding this to the DocumentLibrary
> product engineer, too.
> 
> Leigh Ann
> 

Try testing out pdftotext and see what happens. Let me know what you
find out.

-- 
| Casey Duncan
| Kaivo, Inc.
| cduncan@kaivo.com
`------------------>