Automated PDF modification
December 17, 2012 1:55 PM   Subscribe

Is there an automated way of placing elements from one PDF file into another? Open to coding this via Python if a relevant module exists.

I have two pdf's, Doc 1 and Doc 2. All pages in both documents are US Letter sized. Doc 1 contains large tables of values (generated in Excel with PDFCreator), one per page; this is the only content on the pages. Doc 2 contains pages with my company's border, and header info (title, page # etc.). Doc 1 tables go into Doc 2 bordered pages. I would like an automated way of taking the tables of Doc 1, and placing each into a separate page in Doc 2, without overlapping elements.

I'm decently proficient in Python and think I could code something if some pdf/vector graphics handling modules exist. I've done nontrivial programming with the xlrd3 module for working with our Excel files. How I think the code for this might work:

Open Doc 1, Doc 2 as vector images
For each page in Doc 1, Doc 2:
- Get content bounds in Doc 1
- Scale content in Doc 1 to fit in Doc 2 borders
- Insert in Doc 2 at [coordinates]

What Python modules would I need to do this? Or any other approaches would be welcome.

- This task recurs in my work every few weeks, and the pages can total over a hundred. I handle this currently by printing out Doc 2 (page borders), then refeeding the pages into the printer and printing Doc 1. This is inconvenient because the printer often fails to grab the pages, and because I have to run back to my computer to issue the next job (it's not actually 2 files each time, more like 14 separate pairs of such files, that must be printed separately), and occasionally I can get tripped up by coworkers printing over my pages.

- I often paste Excel tables directly into Word docs. This fails here because the formatting gets mangled in Word from having lots of merged cells and landscape oriented tables in portrait orientation pages. I can sort of transpose the tables in Excel, and set the text orientation in Word to vertical, but each table requires a ton of cleanup, and there are many of them.

- I can manually combine the docs by opening the PDFs in Inkscape (or other vector illustrating program), but again, 100+ tables. Inkscape opens up one page at a time.
posted by mnemonic to Computers & Internet (4 answers total) 1 user marked this as a favorite
pyPdf is one way of attacking the problem. One word of caution with things like PDF: There is a specification of PDF but expect every tool that reads or produces PDFs to have its own set of quirks.
posted by mmascolino at 2:18 PM on December 17, 2012 [1 favorite]

Here is a pdf2svg utility which is open source and pre-packaged in Ubuntu at least; once they're SVG, it's just text or can be handled via XML parsing. (Or likely there's a Python SVG-handling library somewhere.)
posted by XMLicious at 2:27 PM on December 17, 2012

With PDFtk I would:
  • Split Doc1 and Doc2 into individual pages
  • With PDFjam or Multivalent, scale the pages of Doc1 to fit Doc2
  • Place the content of the Doc1 page as a stamp on Doc2
  • Merge all the resulting pages into one file.

posted by scruss at 2:33 PM on December 17, 2012

If you use PHP at all, I would suggest TCPDF. You can use Images, documents, or another PDF as a template, and then insert whatever you want to overlayed onto the template. It supports PDF forms as well. It's basically the workhorse of any PDF-based application I create these days.
posted by thanotopsis at 5:10 PM on December 17, 2012

« Older I'm having serious neck and sh...   |  Looking to turn the Winston Ch... Newer »
This thread is closed to new comments.