How can I convert a jpeg to plain text?
March 16, 2006 6:16 AM   Subscribe

I am a journalist who sometimes needs to send article clips to prospective employers. I have scanned the newspaper articles and saved them as jpeg files. Is there a program that will allow me to convert the jpeg files to plain text? The jpeg files are too large to e-mail, and I really just need to send the text of the articles.
posted by zembla3 to Computers & Internet (15 answers total)
 
You want some OCR (Optical Character Recognition) software.

I'm a bit out of date on this but you'll often find OCR packages bundled with scanners. A quick google tells me that SimpleOCR is a free OCR package that seems fairly good, but I've not tried it.
posted by handee at 6:28 AM on March 16, 2006


Actually it looks like SimpleOCR needs TIFF input or scanner access. So you might need a graphics package to convert from JPEG to TIFF, but that's not a problem.
posted by handee at 6:30 AM on March 16, 2006


In my experience, editors still like hard copies best. Get thee to an old fashioned photocopier!
posted by CunningLinguist at 6:47 AM on March 16, 2006


I've used Omni Page before and it works surprisingly well.
posted by bondcliff at 7:06 AM on March 16, 2006


Ditto the "send hard copies." Otherwise, don't you have the original Word documents of the articles, or is there an online edition of the publication you're working for? Those are easier than OCR, and don't have the potential pitfall of misspellings/broken words that could result from using OCR.

Also, the scanner you used might have built-in OCR. Worth a shot.
posted by limeonaire at 7:22 AM on March 16, 2006


Alternately, you can scan again and scan to TIFF, then use SimpleOCR, if you're really bent on doing it that way.
posted by limeonaire at 7:23 AM on March 16, 2006


Do you really want to send out text files rather than something that looks like it was printed in a publication, with headlines and fonts and columns and such? If you use OCR, you'll just be sending the text.

I've recently been scanning a lot of documents for work --- mostly hardcopy letters that I need to e-mail around to people. When I scan these into PDF documents at a reasonable DPI (e.g. 300), they come out very small. I'd suggest you look at your scan settings. It may be that you're getting something like 2000 DPI jpegs, which will give you unnecessarily huge files.
posted by alms at 7:43 AM on March 16, 2006


As a journalist myself who's read a lot of people's clips, I second the hard copies. We read a lot of stuff via email, but newspaper clips just show better in dead-tree form.
posted by GaelFC at 7:47 AM on March 16, 2006


To expand on alms' answer--scan your future clips using the "black and white" setting rather than the "grayscale" setting. Your graphic files will be FAR smaller.

You can do this conversion in Photoshop (or any number of free/cheap image editors), too. You may have to play with the sliders to make things crisp and readable.

This is a good option if you find the OCR results are not to your liking.
posted by bcwinters at 8:14 AM on March 16, 2006


Does the original newspaper give you auth to do it? The publisher has rights to it, not you, methinks?
posted by vanoakenfold at 10:13 AM on March 16, 2006


Here are two options that I like coming addressing your needs from a slightly different perspective.

1)If you nabbed a couple google web recently or have something similar, I'd just post all your excerpts nice and pretty on the webpage and tell prospective employers to just look at the webpage.

2)Alternatively, why not place all the files in an esnips folder (they're upgrading for the next couple of hours). You get one GB of space and it works quite well (use the toolbar for bigger and quicker uploads). I've used it on several occasions. I'd just use my own address as the contact to get the link and check to make sure everything is working properly and then just give your prospective employer the link.

Good luck in your job hunt.
posted by bim at 10:47 AM on March 16, 2006


Also, the JPEG format is the anathema to OCR. The artifacts that it creates will only hinder things. JPEG is meant for photorealistic images only. Do not use it for line art or scanned text. Scan to a lossless format such as TIFF or PNG if you ever intend to run the image through OCR.
posted by Rhomboid at 1:42 PM on March 16, 2006


Acrobat (full version) will OCR an image as well. Works great.
posted by madderhatter at 5:39 PM on March 16, 2006


I know of Kleptomania and Capture Text.

Screenshot here
posted by labnol at 2:23 AM on March 17, 2006


Even Abbyy ScanToOffice does a good job in extracting text from images and pdf files.
posted by labnol at 3:30 AM on March 17, 2006


« Older When can I come back from Mexico?   |   Recommend a good third-party power adapter for an... Newer »
This thread is closed to new comments.