Coding Project - How to get started merging PDFs
August 9, 2016 8:15 AM   Subscribe

I'd like to tackle a (I think) small coding project - how do I get started, assuming I know basically nothing. I'd like to have some way of batch merging pdf files based on file names.

For example:

If I have two pdf files:
Customer Name - Order1.pdf
Customer Name - Receipt1.pdf

I'd like to combine the two files into one pdf, named Customer Name - Order1.pdf

I have some software which I can use to do this manually (pdfsam), but it's becoming a hassle to merge a large batch of files manually. I need some way of doing this in one step for a large number of files on a daily/weekly basis.

I've been wanting a small project to learn some basic coding. Is this doable for someone with more or less no experience? How would you get started? PDFsam has a command line interface, if that's helpful. What language would I use? Where would I get started learning how to tackle this?

Thanks,
posted by pilibeen to Computers & Internet (11 answers total) 8 users marked this as a favorite
 
What platform are you planning to develop on?
What workflow do you want the software to have: drop a bunch of file icons on it and have it automatically process 'em all into a standard folder? Work from the command line?
posted by scruss at 8:29 AM on August 9, 2016 [1 favorite]


You could probably do this with a shell script.
posted by mskyle at 8:31 AM on August 9, 2016


Best answer: This is definitely a good beginner's project. You could do it any number of languages--Python is beginner friendly and powerful (and you could use the PyPDF2 library to merge the PDFs).

You could start by breaking the whole thing up into discrete problems. For example:
1. Get a list of the PDFs you want to process
2. Group the PDFs by filename
3. Merge the resulting groups

(If you just want to get the job done, the fastest way would be just to use pdfsam's command line interface wrapped in a simple shell script, PowerShell if you're using Windows).
posted by dadaclonefly at 8:32 AM on August 9, 2016 [5 favorites]


Best answer: Using a CLI isn't programming, and even a shell script that calls a command line utility isn't exactly programming.

I'd suggest Python for this.

https://www.python.org/downloads/windows/

http://docs.python-guide.org/en/latest/

You may need to google up some basic Python tutorials before these links make immediate sense, if the guide above isn't enough. There's tons of em out there.

http://stackoverflow.com/questions/17104926/pypdf-merging-multiple-pdf-files-into-one-pdf

https://www.boxcontrol.net/merge-pdf-files-with-under-10-lines-in-python.html

https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167

https://pythonhosted.org/PyPDF2/PdfFileMerger.html

https://github.com/mstamy2/PyPDF2
posted by snuffleupagus at 8:32 AM on August 9, 2016 [1 favorite]


I agree with everyone suggesting python. If you're on a Mac or a Linux machine, you probably already have it installed and could get programs running with very little overhead. As in: type your program into a text file and then execute it.

If it was me, I'd probably check out some online python tutorials and then just start trying stuff and hacking on it til it worked. This sounds like a pretty self-contained task so repeatedly iterating and learning as you go would not be terrible.
posted by paper chromatographologist at 8:35 AM on August 9, 2016 [1 favorite]


Best answer: Using a CLI isn't programming, and even a shell script that calls a command line utility isn't exactly programming.

CLI and shell script are absolutely programming --
echo "hello world"
is exactly as "programming" as the canonical first program in any other language. I often write one-line shell loops to do stuff like this, and if I have to do it often, I copy the one-line shell script into a file so I can just type
./run-that-shell-loop-I-like
On linux/unix, I think shell scripting is your best bet. On Windows, it sounds like you could use PowerShell or Cygwin.

If you want to learn a language that even snuffleupagus would agree is a "programming" language, then Python would be a reasonable choice.

Here is a website that looks like it might help you: "Automate the Boring Stuff with Python Practical Programming for Total Beginners"

Good luck!
posted by spacewrench at 9:09 AM on August 9, 2016 [2 favorites]


Best answer: PDFTK does this nicely, if you know a little bit of bash scripting (and are on mac or linux). I'm assuming that all files are named as you provided (with "-" as the delimiter between name and type, and that all of them end in "Order#.pdf and "Receipt#.pdf")

Then you could drop into a directory full of these and do something like this:
#get all the unique customer names
for name in $(ls -1 | cut -f 1 -d - | sort | uniq);do
   #get all the unique order numbers
   for number in $(ls -1 "$name"* | grep Order | sed 's/Order//g;s/.pdf//' | sort | uniq);do 
     #merge them - the "echo" makes it just print this command so you can check it for sanity. Remove the echo to actually run it. 
     echo pdftk "$name"-"Order$number".pdf  "$name"-"Receipt$number".pdf  cat output "$name"-OrderCombined"$number".pdf
  done
done
This code has not been checked for correctness, but oughta get you mostly there. If you already have pdfsam, you could swap out that command for the pdftk command as appropriate
posted by chrisamiller at 9:35 AM on August 9, 2016


If you want to learn a language that even snuffleupagus would agree is a "programming" language, then Python would be a reasonable choice.


This is pretty funny, because I'm a mostly incompetent hobby programmer at best (at least, for the time being) and a language tourist. I'm not a dogmatist.

The distinction between shell scripts and programming (especially in a language like Python) maybe be somewhat illusory, but it seemed to fit the ask. Which was more about 'learning to code,' than the simplest solution to the task of merging PDFs (as I read it.) Otherwise the simplest answer (on Windows) is a simple batch file that runs PDFsam against all the files in a temp directory. But that doesn't teach you much about coding, it teaches you how to automate calling a utility from the command line.

If OP wants this to be an entree to general programming, something platform specific like Bash or PowerShell probably isn't the greatest choice (Cygwin nonwithstanding.)
posted by snuffleupagus at 9:50 AM on August 9, 2016


Best answer: pilibeen's question is the basis of so many rewarding programming tasks: I have a manual task that I do a lot and it is boring. If you can use a computer to automate that somehow, then it's programming. All programming languages have the same concepts needed here — looping through a bunch of files, selecting files by matching parts of their names, making decisions based on something (name, type, size, etc) to do with the file, making strings for new file names from parts of old file names, checking to see if files exist, doing something if a process fails, etc — that it's neither fair nor helpful to be all ‘hurf durf not programmer’ about it. Use whatever you can find; the choice may be limited if it's a locked-down corporate Windows box.

Once pilibeen has experienced the joy of their first program actually working in no matter how limited a way (and it is a joy), only then is the time to consider whether another tool could do it better. If the only metric is that it must be faster than doing it by hand, then every scripting language will do great. Even Python …
posted by scruss at 10:54 AM on August 9, 2016 [2 favorites]


Best answer: As chrisamiller suggested, the command line version of pdftk will work nicely for this task. I often use pdftk on windows to combine pdf documents. The version that I use is part of the cygwin suite but the pdftk website shows a native windows version. Mac and linux versions are also available for download.

It looks like the command line version is named "pdftk/server". Quoting from the site : "PDFtk Server is our command-line tool for working with PDFs. It is commonly used for client-side scripting or server-side processing of PDFs."

That's not to say that pdfsam will not work or pdftk is (or is not) better than pdfsam but pdftk will definitely do what you need.

As far as the scripting goes, either bash or python would work just fine. I think bash would probably be better for getting something going quickly but python would be better as a "learn some basic coding" project.
posted by metadave at 12:12 PM on August 9, 2016 [1 favorite]


Response by poster: Thanks for all the excellent advice. I'm currently reading Doug's Bash Guide to get me up and running quickly (hopefully) on pdftk.

My plan is to get a script working for now, and take my time working on a Python solution in the future. I see a lot of Coursera courses centered around beginning Python.

Thanks again for all the help!
posted by pilibeen at 3:57 PM on August 14, 2016


« Older How should we insulate the attic?   |   Videography/lighting setup for cooking videos Newer »
This thread is closed to new comments.