Join 3,425 readers in helping fund MetaFilter (Hide)


Pdf to text conversion
March 2, 2010 2:00 PM   Subscribe

What is the best way of converting .pdf to .doc or .txt?

I've been using LaTeX for writing articles and I sometimes need to convert pdf to .doc or .txt. I have experimented with LaTeX to RTF but I don't find it reliable for some Tex commands. I've also played around with Acrobat Pro but would prefer something free. Is there a reliable way of converting pdfs?

Thanks!
posted by a womble is an active kind of sloth to Computers & Internet (10 answers total) 14 users marked this as a favorite
 
Try this free online service: http://www.convertpdftoword.net/Default.aspx
posted by iam2bz2p at 2:08 PM on March 2, 2010


Basically you're trying to OCR something, and in that world, "free" and "reliable" are mutually exclusive.

ABBYY Finereader is generally recognized as the cream of the crop as far as precision OCR is concerned. It's not as effective for rapid, batch OCR, but if you have a small-ish number of documents that may be mangled or otherwise difficult to OCR, it's your best bet.
posted by jckll at 2:18 PM on March 2, 2010 [1 favorite]


I'm not trying to use OCR - these are pdf's with selectable text, that I can copy and paste into a text document if I want to.
posted by a womble is an active kind of sloth at 2:21 PM on March 2, 2010


There are many ways to do this. For example, if you want to convert .pdf to an easily-editable image file with embedded text, then OpenOffice Draw with this Import .pdf extension is probably the way to go.

But it sounds like you just want workable .txt or .doc files. It's annoying, I know, to have to remove all the line breaks that result when you just select text from a .pdf and copy it over to another document.

Personally, I have had a lot of success with calibre, an ebook document manager and converter. I don't know what operating system you're on, but there are versions of calibre for Windows, OS X, and Linux, so I imagine you can find one that works for you. It produces very nice (paragraph-formatted) conversions of .pdfs, .epubs, and other documents.
posted by koeselitz at 2:29 PM on March 2, 2010 [2 favorites]


You could try pdftohtml followed by your favorite html2txt converter. In my experience this only works for very simple PDFs where you don't care a whole lot about formatting. If you want to convert documents that contain tables and oddly placed columns of text, go with a commercial solution and save yourself a lot of headaches. You may not need OCR, but you will need some intelligence in the app to figure out what is really a paragraph vs. table vs. single line of free floating text.
posted by benzenedream at 2:37 PM on March 2, 2010


Actually, Adobe Reader has a 'Save as Text' function in the File menu. If, as you say, you can copy and paste the text from the pdf, you should be able to use this command to extract the text contents.
posted by Busy Old Fool at 12:42 AM on March 3, 2010


What about using something like Zamzar? I've had pretty good results using this for conversions - if it's a very formatted document, you'll probably lose some of that but it should convert okay if it isn't.
posted by eb98jdb at 8:20 AM on March 3, 2010


This free online service does a bang-up job: http://www.pdfonline.com/pdf2word/index.asp. I've been using it at work almost daily, copying text from our old PDF files and pasting it into new .doc files. It's quick and easy.

*Crappily, at this current moment the site is under maintenance, but it was up yesterday and I imagine it will be available again very soon.
posted by kitcat at 10:31 AM on March 3, 2010


Also, you shouldn't have to remove line breaks one by one. Just copy the text over to notepad or the mac equivalent first. This will erase most formatting. Then copy again and paste into your new file.
posted by kitcat at 10:36 AM on March 3, 2010


I've had success using Pandoc to convert LaTeX to rtf and HTML. It's a Haskell library, but you can Try it Online without installing anything. I'd recommend you giving it a go, but depending on the commands you're using, YMMV.

It supports a lot of formats - "markdown and (subsets of) reStructuredText, HTML, and LaTeX, and it can write markdown, reStructuredText, HTML, LaTeX, ConTeXt, PDF, RTF, DocBook XML, OpenDocument XML, ODT, GNU Texinfo, MediaWiki markup, groff man pages, and S5 HTML slide shows" - so it's quite a useful swiss-army knife for converting between text formats.
posted by James Scott-Brown at 11:50 AM on March 3, 2010 [1 favorite]


« Older Please give me recommendations...   |  Can someone explain internal p... Newer »
This thread is closed to new comments.