Need magic to convert PDF to Excel
October 23, 2007 11:46 AM Subscribe
OCRfilter: any software that will take a PDF and convert it to Excel?
I'm looking something that will take a PDF or a TIFF of an account statement or the like (columns of numbers) and convert it to Excel.
I know nothing's going to be even close to perfect, but even getting a csv file will al least save time to ke-key every single number into Excel. Clean up and tweaking can happen after the numbers themselves are in there.
Anyone use anything that does this?
I'm looking something that will take a PDF or a TIFF of an account statement or the like (columns of numbers) and convert it to Excel.
I know nothing's going to be even close to perfect, but even getting a csv file will al least save time to ke-key every single number into Excel. Clean up and tweaking can happen after the numbers themselves are in there.
Anyone use anything that does this?
I've had a lot of luck just copying a table from a PDF and pasting it into Excel.
posted by hydropsyche at 12:02 PM on October 23, 2007
posted by hydropsyche at 12:02 PM on October 23, 2007
It can generate HTML and I think excel can take an html table as input so should work.
posted by zeoslap at 12:02 PM on October 23, 2007
posted by zeoslap at 12:02 PM on October 23, 2007
If you already have MS Office, the Office Document Image Writer virtual printer and Image viewer contain fairly decent (Omnipage engine) OCR. Image viewer can read a TIFF directly and apply OCR.
posted by scruss at 12:39 PM on October 23, 2007
posted by scruss at 12:39 PM on October 23, 2007
Here's how I have done this in the past.
Use pdf2text (aka pdfToText) to convert the page to raw text. Use the -layout flag to keep everything aligned. Then import into open office calc, choosing space as your delimiter and the 'merge delimiters' option.
I've had fairly good results doing this, sometimes combined with a little bit of scripting in perl or ruby to tidy things up prior to the import.
posted by chrisamiller at 12:42 PM on October 23, 2007
Use pdf2text (aka pdfToText) to convert the page to raw text. Use the -layout flag to keep everything aligned. Then import into open office calc, choosing space as your delimiter and the 'merge delimiters' option.
I've had fairly good results doing this, sometimes combined with a little bit of scripting in perl or ruby to tidy things up prior to the import.
posted by chrisamiller at 12:42 PM on October 23, 2007
ABBYY is pretty good. We do this a lot at my job and we usually use OmniPage to OCR it. It helps if you define the columns and rows before you recognize the text.
posted by kenzi23 at 2:37 PM on October 23, 2007
posted by kenzi23 at 2:37 PM on October 23, 2007
I believe that Able2Extract is what you're looking for.
posted by MrHappyGoLucky at 12:46 PM on October 25, 2007
posted by MrHappyGoLucky at 12:46 PM on October 25, 2007
This thread is closed to new comments.
posted by zeoslap at 12:00 PM on October 23, 2007