scraping pdf content?
December 11, 2011 7:21 PM   Subscribe

I have about 4000 pdfs that i need to scrape data from and put into a database.

The pdf's all read similarly and read like this(brackets indicate variable text):

[site-name]
[a short description]
Sub-Heading1
[a few paragraphs]
Sub-Heading2
[a few more paragraphs]

I would like to get this information into a database (a text file or spreadsheet is also fine, as I can get it into a database from there). The fields would be everything above, except for the two Sub-Headings.

I've looked around and I haven't found a method that really appears to be able to do this as most seem to be set up for pdf tables or data.

Any ideas or leads would be appreciated. I have some (eg: very little) python and php experience at my disposal. I have both windows and mac machines available. I could afford a reasonably priced product to do this, if it would work well.
posted by buttercup to Computers & Internet (8 answers total) 6 users marked this as a favorite
 
Are the all generated from the same source?

PDF files are not considered to be readable as ascii files, but they kind of are if you are lucky. Try to open a couple of them in a text editor and see how the structure looks like. If the structure makes sense, you might be able to parse the data from there. I think this is better than converting the flies to another format since that might destroy some of the data structure that you need for parsing.
posted by brorfred at 7:30 PM on December 11, 2011


Check out PDFMiner.
posted by djb at 7:31 PM on December 11, 2011


Have you tried one of these?

http://www.pdftoexcelonline.com/

http://www.zamzar.com/
posted by vidur at 7:31 PM on December 11, 2011


If forgot to mention that you can do wonders by converting the pdf to eps or ps (postscript). This format is meant to be readable by humans and you might be able to parse it more easily. There is a linux/unix/mac command called pdf2ps that does this.
posted by brorfred at 7:36 PM on December 11, 2011


Try running strings on one of the pdf (on your Mac, in Terminal). If you see the text you want, then you can likely parse it out. If not, then you might be stuck - some PDFs are images of text, not the text itself.
posted by spaceman_spiff at 7:47 PM on December 11, 2011


You probably won't be able to get anything useful out of a PDF file using strings, since they (usually) store data as compressed streams. You can use pdftotext if you don't mind throwing away all the formatting.

When you get right down to it, the PDF format only defines simple commands like "move to position (x,y)", "draw line", and "draw character A from font B". If you want anything higher-level than that, you'll need to reconstruct it yourself.
posted by teraflop at 8:41 PM on December 11, 2011


Apache's PDFBox has the ability to parse PDF text, provided that the text did not originate from an image. Some PDF writers are lazy and just treat the entire document as some image. If you can't select and highlight the text on the document, you're out of luck. If you can, you can capture the text programmatically through a simple Java script that then transfers the data to a database.

You can also get a premium package that will use high end optical character recognition software to translate image text into selectable text, but that will cost you money for the license.
posted by DetriusXii at 9:17 PM on December 11, 2011 [1 favorite]


I do a lot of jobs like this.

The first issue is the PDF format. Is what you see on screen is an image in the PDF or active text? You can check by just trying to highlight and copy. If the text comes out and pastes as even just a jumble of text, then it is active. If you cannot, then it is an image.

For the image type, you need OCR, but not all OCR packages are equally effective. Some do well with straight documents and some do well with spreadsheets. You just have to try them out. Also, the quality of the scan is very important. If the scan was not at a decent resolution, no software is going to correct that. Or if it does, you will still have to go back and check everything as the OCR software will make assumptions as to what certain characters are and yield errors. Hopefully, if it is a static image PDF, the original was scanned well.

If the PDF has active text there are other solutions for the job. There are PDF converters that will not only extract the text, but keep the formatting and turn the PDF into a Word doc, text, spreadsheet or any other number of formats. You can also use OCR apps as they often will deal with active text PDFs as well as static image PDFs. The process is not always perfect, but with a little tweaking of the conversion app, you can usually get output that will allow later cleanup.

Here are the three I work with. I think these are all Win apps, but there may be a Mac and Linux flavor too. Each has a trial period of some sort without having to give any info other than an email.

Abbyy Fine Reader
NitroPDF
Acrobat Pro X

Nuance also has a set of decent products (Omnipage, PDF Converter Pro), but I have not used them in a while and don't know if there is a trial option.

As far as the online PDF converters go, they are fine for small jobs - maybe a page or two or ten. For a 4k load, forget it as you have to convert each file individually on most sites. Also, there is often a file size limit, so combining a bunch of PDFs into one file first will be out of the question. What you need is a stand-alone app on your drive or accessible from your computer over a local network.

With the trial software noted above, you should be able to get the conversion part of the job done in the demo period. 4k docs is a lot, but the part of the job where you need a converter is a manageable part of the task and can be automated depending on the software. Your hardware capacity (RAM, CPUs, etc) will play a big factor here too. This type of work can put a heavy load on the machine when in progress, so perusing MeFi (if it is the same computer) is probably out of the question while the conversion is happening. What you are shooting for with the conversion is to get the PDF data into a format where you can use a more effective editing platform to do the cleanup. The PDF/OCR apps only offer limited capability in this area.

At this point, there are some options that would take more time to go into and I cannot really comment more without actually seeing the data. For example, one option is to use a macro to combine the converted docs into a single file with Word, but that may not solve the issue of keeping each PDF's data as a separate record later. That may be solved by combining the PDF into a single document before conversion and would break out each PDF as a separate record. Then there is the Excel portion of the job which also will depend on how the data is shaped. There is no single though line here as the process will be defined by and tailored to your PDFs' content, formatting and other variables.

Basically, there is more than one way to skin a cat here and it is at this point that some of the document specific properties come into play as you will be seeking commonalities that can be exploited to place delimiters with.

The whole thing is a process that is often defined as you go through it as opposed to marking off a checklist of what you want to do before starting.

If you have questions, MeFiMail.
posted by lampshade at 1:24 PM on December 24, 2011


« Older Help not not need a new car   |   Help Me Identify This Disco Song Newer »
This thread is closed to new comments.