Using Perl and PDF::API2 to Update PDF Properties and Metadata

What do you do when you have 600 PDF documents with titles in all caps, when you need the titles to be title-cased? I dreaded the thought of asking anyone to open each document and edit the titles by hand, not to mention fearing the typos that process might introduce.

For better or worse, here was my solution:

Grab the Perl PDF::API2 module from CPAN
Grab a Perl titlecase script
Write a script opens the PDF, titlecases the title, and saves the PDF

Sounds fast and easy, right? Well, there were a few hitches:

Perl on my work system is jacked, thanks to a bunch of Oracle files for Perl 5.8.3 that interfere with my Perl 5.8.8 installation. Rather than try to sort that out, I decided to use a clean system instead.
The PDF I was using as a test case threw an error, which I could eliminate if I saved it as an older PDF version (1.4 or lower). I spent quite a bit of time trying to figure that out, until I discovered that the real files I needed to update did not produce that error. (In case you ever need to, though, it is simple to use Acrobat Pro to change the PDF version of multiple files in one fell swoop.)
The title is stored in multiple places. I expected to find it in the info hash, but if the title is also stored in the XML-based XMP metadata, you can get some unexpected results. I updated my script to update both.

The titlecase script I found treats all-uppercase words as acronyms or other terms that belong in all caps. This was a simple workaround: I lowercased everything before sending it through the titlecase function. Of course, many of the titles contained acronyms and other terms that should have been left in all caps: ADR, ERISA, IRS, and NLRB, D.C. and U.S., and Roman numerals, e.g. VII. I added the terms I found to the titlecase script to uppercase those terms.

I had the script produce a text file listing the filenames, the original title, and the updated title so that I could check it for errors. Names, like McDonnell, were converted to Mcdonnell. On top of that, some of the titles contained OCR errors. But cobbling together a script and updating those few errors by hand was still less time-consuming, and far less tedious, than manually titlecasing 600 PDFs.

Leave a Reply