scraping pdf content?
December 11, 2011 7:21 PM Subscribe
I have about 4000 pdfs that i need to scrape data from and put into a database.
The pdf's all read similarly and read like this(brackets indicate variable text):
[a short description]
[a few paragraphs]
[a few more paragraphs]
I would like to get this information into a database (a text file or spreadsheet is also fine, as I can get it into a database from there). The fields would be everything above, except for the two Sub-Headings.
I've looked around and I haven't found a method that really appears to be able to do this as most seem to be set up for pdf tables or data.
Any ideas or leads would be appreciated. I have some (eg: very little) python and php experience at my disposal. I have both windows and mac machines available. I could afford a reasonably priced product to do this, if it would work well.