Automated PDF modification
December 17, 2012 1:55 PM Subscribe
Is there an automated way of placing elements from one PDF file into another? Open to coding this via Python if a relevant module exists.
I have two pdf's, Doc 1 and Doc 2. All pages in both documents are US Letter sized. Doc 1 contains large tables of values (generated in Excel with PDFCreator), one per page; this is the only content on the pages. Doc 2 contains pages with my company's border, and header info (title, page # etc.). Doc 1 tables go into Doc 2 bordered pages. I would like an automated way of taking the tables of Doc 1, and placing each into a separate page in Doc 2, without overlapping elements.
I'm decently proficient in Python and think I could code something if some pdf/vector graphics handling modules exist. I've done nontrivial programming with the xlrd3 module for working with our Excel files. How I think the code for this might work:
Open Doc 1, Doc 2 as vector images
For each page in Doc 1, Doc 2:
- Get content bounds in Doc 1
- Scale content in Doc 1 to fit in Doc 2 borders
- Insert in Doc 2 at [coordinates]
What Python modules would I need to do this? Or any other approaches would be welcome.
Notes
- This task recurs in my work every few weeks, and the pages can total over a hundred. I handle this currently by printing out Doc 2 (page borders), then refeeding the pages into the printer and printing Doc 1. This is inconvenient because the printer often fails to grab the pages, and because I have to run back to my computer to issue the next job (it's not actually 2 files each time, more like 14 separate pairs of such files, that must be printed separately), and occasionally I can get tripped up by coworkers printing over my pages.
- I often paste Excel tables directly into Word docs. This fails here because the formatting gets mangled in Word from having lots of merged cells and landscape oriented tables in portrait orientation pages. I can sort of transpose the tables in Excel, and set the text orientation in Word to vertical, but each table requires a ton of cleanup, and there are many of them.
- I can manually combine the docs by opening the PDFs in Inkscape (or other vector illustrating program), but again, 100+ tables. Inkscape opens up one page at a time.
I have two pdf's, Doc 1 and Doc 2. All pages in both documents are US Letter sized. Doc 1 contains large tables of values (generated in Excel with PDFCreator), one per page; this is the only content on the pages. Doc 2 contains pages with my company's border, and header info (title, page # etc.). Doc 1 tables go into Doc 2 bordered pages. I would like an automated way of taking the tables of Doc 1, and placing each into a separate page in Doc 2, without overlapping elements.
I'm decently proficient in Python and think I could code something if some pdf/vector graphics handling modules exist. I've done nontrivial programming with the xlrd3 module for working with our Excel files. How I think the code for this might work:
Open Doc 1, Doc 2 as vector images
For each page in Doc 1, Doc 2:
- Get content bounds in Doc 1
- Scale content in Doc 1 to fit in Doc 2 borders
- Insert in Doc 2 at [coordinates]
What Python modules would I need to do this? Or any other approaches would be welcome.
Notes
- This task recurs in my work every few weeks, and the pages can total over a hundred. I handle this currently by printing out Doc 2 (page borders), then refeeding the pages into the printer and printing Doc 1. This is inconvenient because the printer often fails to grab the pages, and because I have to run back to my computer to issue the next job (it's not actually 2 files each time, more like 14 separate pairs of such files, that must be printed separately), and occasionally I can get tripped up by coworkers printing over my pages.
- I often paste Excel tables directly into Word docs. This fails here because the formatting gets mangled in Word from having lots of merged cells and landscape oriented tables in portrait orientation pages. I can sort of transpose the tables in Excel, and set the text orientation in Word to vertical, but each table requires a ton of cleanup, and there are many of them.
- I can manually combine the docs by opening the PDFs in Inkscape (or other vector illustrating program), but again, 100+ tables. Inkscape opens up one page at a time.
Here is a pdf2svg utility which is open source and pre-packaged in Ubuntu at least; once they're SVG, it's just text or can be handled via XML parsing. (Or likely there's a Python SVG-handling library somewhere.)
posted by XMLicious at 2:27 PM on December 17, 2012
posted by XMLicious at 2:27 PM on December 17, 2012
Best answer: With PDFtk I would:
posted by scruss at 2:33 PM on December 17, 2012
- Split Doc1 and Doc2 into individual pages
- With PDFjam or Multivalent, scale the pages of Doc1 to fit Doc2
- Place the content of the Doc1 page as a stamp on Doc2
- Merge all the resulting pages into one file.
posted by scruss at 2:33 PM on December 17, 2012
If you use PHP at all, I would suggest TCPDF. You can use Images, documents, or another PDF as a template, and then insert whatever you want to overlayed onto the template. It supports PDF forms as well. It's basically the workhorse of any PDF-based application I create these days.
posted by thanotopsis at 5:10 PM on December 17, 2012
posted by thanotopsis at 5:10 PM on December 17, 2012
« Older Can you help with this pain in my neck? | Help turn a famous quote into an equation!!! Newer »
This thread is closed to new comments.
posted by mmascolino at 2:18 PM on December 17, 2012 [1 favorite]