[Zope] my solution for manipulating PDF

Kyler B. Laird laird@ecn.purdue.edu
Sat, 16 Jun 2001 13:24:48 -0500


For the projects we do, we frequently need to
generate a PDF file that consists of PDF files
(submitted to us by proposal/paper authors) 
and dynamic data (from other Zope objects).

Before we started The Great Zope Migration, we
used LaTeX to generate the dynamic pages and
then glued them together with pjscript.
	http://www.etymon.com/pj/pjscript.html

I've been trying to find a good way to do this
under Zope.  I have looked at PDFlib.
	http://www.pdflib.com/pdflib/
I like that it has Python bindings, but for
manipulation of existing PDF, we would have to
purchase its sister product, PDI.

More recently, I was pointed at ReportLab.
	http://www.reportlab.com/
This is even better; it's written in Python.
(I see that it's being used by other Zopers,
too.)  Unfortunately, it too requires another
product, PageCatcher
	http://www.reportlab.com/pageCatcher/
to do the things I want to do.  It would be a
slick integrated solution, but I might end up
sing it without PageCatcher.

We have money to spend, but I'm dead set
against using Closed solutions right now, so 
neither of these appealed to me enough.  
Instead, I decided to fall back on pjscript.

Cameron Laird helped me get started with a
simple PJ document class.  I made it work as
a Zope External Method for trivial purposes,
but then I rewrote most of it today so that I
can do most everything that pjscript offers. 
(I'll include PJ.py, the Zope extension at
the end of this message.)  I was surprised at
how easy it was to do this.

Here's how I use it from a Python Script:
	document = container.WRRC.Ztools.PJdoc()
	
	# Sandwich the NYT fax between two simple pages.
	document.readpdf(container['simple.pdf'].data)
	document.appendpdf(container['nytfax.pdf'].data)
	document.appendpdf(container['simple.pdf'].data)
	
	# Remove the first page of the NYT fax.
	document.deletepage(2)
	
	# Make an 'X'.
	document.setpage(2)
	document.drawline((0,0), (500,500), 5)
	document.drawline((0,500), (500,0), 5)
	
	# Throw text around.
	document.setpage(1)
	document.initxy()
	document.drawtext(text='upper left', font='Courier-BoldOblique', fontsize=8)
	document.drawtext(text='PJdoc test', font='Helvetica-Bold', fontsize=16, pos=(50, 600))
	document.drawtext('Howdy!', fontsize=90)
	
	# Write lines of text.
	document.setinit((200,400))
	document.initxy()
	document.drawtext('one')
	document.nextxy()
	document.drawtext('two')
	document.nextxy()
	document.drawtext('three')
	
	# Set the resulting document's info.
	document.setinfo('Author', 'Kyler Laird')
	document.setinfo('Keywords', 'foo blah test NYTimes')
	
	context.REQUEST.RESPONSE.setHeader('Content-type', 'application/pdf')
	return document.writepdf()

Although I really balked at using system() calls
into pjscript (wanting to go straight to the PJ
API through some Python/Java wizardry) for this,
it's hardly noticeable at this level.  I like
the solution.  I'm almost confident that I can
safely encourage its wide use here.

Next I'm going to work on extensions for
Ghostscript (for PS->PDF conversion), LaTeX and
html2ps.  I'll probably revisit ReportLab, too.

I only offer this because I'm guessing someone
else might travel down this road someday and it
could help a bit.  I'd put it somewhere more
permanent, but I'm not willing to commit to its
correctness nor to its maintenance right now.  
Please smack me if I'm out of line posting such
things here.

Thank you.

--kyler

===================================================
PJ.py
===================================================
import tempfile
import os
import sys
import string

# Return a PJdoc object to caller.
def PJdoc():
	return _PJdoc()

class _PJdoc:
	# Allow Zope users to access PJdoc's methods.
	__allow_access_to_unprotected_subobjects__=1


	# CL intends to re-sort these def-s into utilities and publics.
	def __init__(self):
		# the PJ script I'm building
		self._script = ''

		# Keep track of temporary files.
		self.tmpfiles = []

	def __del__(self):
		# Clean up my temporary files.
		for file in self.tmpfiles:
			os.unlink(file)

	# Put a string in a temporary file.
	# Keep track of it so we can delete it when
	# this object is destroyed.
	def _write_string_to_tmpfile(self, text):
		filename = tempfile.mktemp("pjs")

		# Add to list of temporary files.
		self.tmpfiles.append(filename)

		file = open(filename, "w")

		# I had problems just writing everything
		# at once, so now I write in 1K chunks.
		start = 0
		textlen = len(text)
		bufsiz = 1024
		while 1:
			end = start + bufsiz
	
			if (end >= textlen):
				file.write(text[start:])
				break
			else:
				file.write(text[start:end])
	
			start = end
		file.close()

		# Tell caller the name of the file used.
		return filename

	# Display our accumulated PJ script.
	def show_script(self):
		return self._script

	# Run the PJ script.
	def _run(self):
		tmpfile = self._write_string_to_tmpfile(self.show_script())

		command_string = "/usr/bin/env pjscript %s" % tmpfile

		result = os.system(command_string)
		if result != 0:
			report = "Failure with '%s'." % command_string
			raise report

	# Add a command to the PJ script.
	def _do(self, string):
		self._script = self._script + string + '\n'

	# Add a command that reads from a file to the PJ script.
	def _read_file_command(self, command, string):
		tmpfile = self._write_string_to_tmpfile(string)

		self._do('$file %s' % tmpfile)
		self._do('%s' % command)

	# Add a command that writes to a file to the PJ script.
	def _write_file_command(self, command):
		# I will handle destroying this.
		tmpfile = tempfile.mktemp("pdf")

		# Set "file" to point at the temprorary file.
		self._do('$file %s\n' % tmpfile)
		# Do the command.
		self._do('%s\n' % command)

		# Run pjscript with the current script.
		# There are better ways to handle this?
		self._run()

		# Read the result.
		file = open(tmpfile, "r")
		string = file.read()
	        file.close

		# Return the text of the output file to
		# the caller.
		return string

	# Set a PJ script variable.
	def _set_variable(self, var, val):
		if val is None:
			return

		# Everything is a string to pjscript.
		val = str(val)

		# Make sure val isn't screwy.
		# If someone got a newline in, an arbitrary command
		# could be executed.
		if (string.find(val, '\n') != -1 or string.find(val, '\r') != -1):
			report = "Invalid value: '%s'." % val
			raise report

		# To store "data" in "x", do "$x data".
		self._do('$%s %s\n' % (var, val))

	# Kyler added this.
	def setpage(self, page=None):
		self._set_variable('page', page)
			
	# Kyler added this.
	def setinit(self, pos):
		(x, y) = pos
		self._set_variable('xinit', x)
		self._set_variable('yinit', y)


	# For commands below, see
	#	http://www.etymon.com/pj/pjscript.html
	# Note that I've handled x, y pairs as position tuples.

	def appendpdf(self, pdfstring):
		self._read_file_command(command='appendpdf', string=pdfstring)

	def deletepage(self, page=None):
		self._set_variable('page', page)
		self._do('deletepage')

	def drawline(self, start, end, linewidth=None):
		(x0, y0) = start
		(x1, y1) = end

		self._set_variable('x0', x0)
		self._set_variable('y0', y0)
		self._set_variable('x1', x1)
		self._set_variable('y1', y1)
		self._set_variable('linewidth', linewidth)
		self._do('drawline')

	def drawtext(self, text, font=None, fontsize=None, page=None, pos=None):
		self._set_variable('text', text)
		self._set_variable('font', font)
		self._set_variable('fontsize', fontsize)

		if pos is not None:
			(x, y) = pos
			self._set_variable('x', x)
			self._set_variable('y', y)
		self._set_variable('page', page)

		self._do('drawtext')

	def initxy(self):
		self._do('initxy')

	def newpdf(self):
		self._do('newpdf')

	def nextxy(self):
		self._do('nextxy')

	def readpdf(self, pdfstring):
		self._read_file_command(command='readpdf', string=pdfstring)

	def setinfo(self, key, text):
		self._set_variable('key', key)
		self._set_variable('text', text)
		self._do('setinfo')
	def writepdf(self):
		return self._write_file_command(command='writepdf')