Using Perl and PDF::API2 to Update PDF Properties and Metadata
What do you do when you have 600 PDF documents with titles in all caps, when you need the titles to be title-cased? I dreaded the thought of asking anyone to open each document and edit the titles by hand, not to mention fearing the typos that process might introduce.
For better or worse, here was my solution:
- Grab the Perl PDF::API2 module from CPAN
- Grab a Perl titlecase script
- Write a script opens the PDF, titlecases the title, and saves the PDF
Sounds fast and easy, right? Well, there were a few hitches:
- Perl on my work system is jacked, thanks to a bunch of Oracle files for Perl 5.8.3 that interfere with my Perl 5.8.8 installation. Rather than try to sort that out, I decided to use a clean system instead.
- The PDF I was using as a test case threw an error, which I could eliminate if I saved it as an older PDF version (1.4 or lower). I spent quite a bit of time trying to figure that out, until I discovered that the real files I needed to update did not produce that error. (In case you ever need to, though, it is simple to use Acrobat Pro to change the PDF version of multiple files in one fell swoop.)
- The title is stored in multiple places. I expected to find it in the info hash, but if the title is also stored in the XML-based XMP metadata, you can get some unexpected results. I updated my script to update both.
The titlecase script I found treats all-uppercase words as acronyms or other terms that belong in all caps. This was a simple workaround: I lowercased everything before sending it through the titlecase function. Of course, many of the titles contained acronyms and other terms that should have been left in all caps: ADR, ERISA, IRS, and NLRB, D.C. and U.S., and Roman numerals, e.g. VII. I added the terms I found to the titlecase script to uppercase those terms.
I had the script produce a text file listing the filenames, the original title, and the updated title so that I could check it for errors. Names, like McDonnell, were converted to Mcdonnell. On top of that, some of the titles contained OCR errors. But cobbling together a script and updating those few errors by hand was still less time-consuming, and far less tedious, than manually titlecasing 600 PDFs.
No comments yet.
Leave a comment
Pages
Archives
- March 2013
- February 2013
- December 2012
- November 2012
- October 2012
- August 2012
- July 2012
- April 2012
- March 2012
- January 2012
- December 2011
- November 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- March 2011
- February 2011
- January 2011
- November 2010
- May 2010
- March 2010
- January 2010
- December 2009
- October 2009
- September 2009
- August 2009
- July 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008