scraping pdf content?
December 11, 2011 7:21 PM Subscribe
I have about 4000 pdfs that i need to scrape data from and put into a database.
The pdf's all read similarly and read like this(brackets indicate variable text):
[site-name]
[a short description]
Sub-Heading1
[a few paragraphs]
Sub-Heading2
[a few more paragraphs]
I would like to get this information into a database (a text file or spreadsheet is also fine, as I can get it into a database from there). The fields would be everything above, except for the two Sub-Headings.
I've looked around and I haven't found a method that really appears to be able to do this as most seem to be set up for pdf tables or data.
Any ideas or leads would be appreciated. I have some (eg: very little) python and php experience at my disposal. I have both windows and mac machines available. I could afford a reasonably priced product to do this, if it would work well.
posted by buttercup to computers & internet (8 answers total) 6 users marked this as a favorite
PDF files are not considered to be readable as ascii files, but they kind of are if you are lucky. Try to open a couple of them in a text editor and see how the structure looks like. If the structure makes sense, you might be able to parse the data from there. I think this is better than converting the flies to another format since that might destroy some of the data structure that you need for parsing.
posted by brorfred at 7:30 PM on December 11, 2011