PyPDF2 (forked from pyPdf) is wonderful. I use it a fair bit in my job, mainly for chopping up PDFs and re-assembling the pages in a different order. It does sometimes have difficulty with non-standard PDFs though that seem fine in other programs. This can be frustrating.
The one that I’ve been battling with today from some PDFs provided by a client was:
PyPDF2.utils.PdfReadError: EOF marker not found
I managed to find a workaround using PDFtk to fix the PDF in memory at the first sign of any trouble. It works well so far, so in case anyone else is having similar issues I thought I’d write it up.
So here’s how I was opening PDF files before.
from PyPDF2 import PdfFileReader
from cStringIO import StringIO
input_path = 'c:/test_in.pdf'
with open(input_path, 'rb') as input_file:
input_buffer = StringIO(input_file.read())
input_pdf = PdfFileReader(input_buffer)
At that point you’re free to do whatever it is you want to do with input_pdf
. Providing of course that it loaded without issue. I’m loading the file into a StringIO
object first for speed; the program this is from does lots of things with the file and StringIO
made things much faster.
So to work around the EOF problem I add a new decompress_pdf
function that gets called if there’s a problem parsing the PDF. It takes the data from the StringIO
and sends it to a PDFtk process on stdin
that simply runs PDFtk’s uncompress
command on the data. The fixed PDF is read back from stdout
and returned as a StringIO
, where things will hopefully carry on as if nothing happened.
from PyPDF2 import PdfFileReader, utils
from cStringIO import StringIO
import subprocess
input_path = 'c:/test_in.pdf'
def decompress_pdf(temp_buffer):
temp_buffer.seek(0) # Make sure we're at the start of the file.
process = subprocess.Popen(['pdftk.exe',
'-', # Read from stdin.
'output',
'-', # Write to stdout.
'uncompress'],
stdin=temp_buffer,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
return StringIO(stdout)
with open(input_path, 'rb') as input_file:
input_buffer = StringIO(input_file.read())
try:
input_pdf = PdfFileReader(input_buffer)
except utils.PdfReadError:
input_pdf = PdfFileReader(decompress_pdf(input_file))
The problem I was seeing seemed to be because of invalid characters appearing after the %%EOF marker in the PDF. PDFtk seems better at fixing this and spits out a valid PDF when the uncompress
command is used.
Of course, more error detection would be good in case parsing still fails, but this worked for me today and made me happy.
Comments
comments powered by Disqus