Extracting Text from PDF WITHOUT also extracting images

I'm looking for an easy, free way to extract text only from a very large PDF file and create a word document. [more inside]
posted by firemonkey on Jun 12, 2016 - 11 answers

"Nah, boss; I'm just a speedreader"

Before building my own... does this software exist? I need to search a collection of PDF & Word files for key phrases, and dump the surrounding lines (x-m to x+n characters/lines, where x is the found phrase) into text files. I occasionally need to search a few dozen files for a few dozen data items, which usually have some identifying text nearby. This needs to be automated. Big bonus if it implements OCR, but that's not essential. Freeware, or cheapware, obviously is best. Windows-based is preferable, but I can do Linux.
posted by IAmBroom on Feb 12, 2014 - 9 answers

PDF copy/paste repeats text - help?

When I copy/paste text from this PDF, the copied text is repeated and sometimes includes additional text. Why is this and how can I fix it? [more inside]
posted by fussbudget on Aug 10, 2012 - 8 answers

Convert text with whitespace to csv

perl or sed or awk help for newbie [more inside]
posted by v-tach on Mar 20, 2012 - 26 answers

Readability for PDFs?

I have to read a lot of academic papers. They are usually formatted into 2 columns per page. This is fine if I print them out, but when I read them on a computer screen, it is difficult/annoying for me to scroll up and down - I lose my place easily (I recognize this may be unusual). Is there software out there that will strip the text from a pdf into one column (and maybe put all the figures at the end or somewhere)? Like the Readability extension for text on the web, but for pdfs. Maybe a plugin for Adobe Acrobat? Linux or Windows, please. Thanks.
posted by bluefly on Oct 19, 2011 - 6 answers

When ampersands attack.

Help me extract text from a PDF of a Powerpoint presentation! [more inside]
posted by lovedbymarylane on Sep 18, 2011 - 8 answers

Should I have any longevity concerns about .epub files?

Should I have any longevity concerns about .epub files? [more inside]
posted by Joe Beese on Jan 20, 2011 - 4 answers

What are good apps for converting PDFs and other text files to non-PDF files?

What are good apps for converting PDFs and other text files to non-PDF text files? [more inside]
posted by GlassHeart on Sep 6, 2010 - 12 answers

Extract Text and Images From a PDF

I'd like to extract the text and images from a multi-page pdf to use on the web. [more inside]
posted by backwards guitar on Aug 28, 2008 - 2 answers

What are the alternatives to pandoc

What are the alternatives to pandoc? I'm looking for tools that will allow me to maintain a large document in a simple plain text format such as markdown and compile it to PDF and HTML. [more inside]
posted by caek on Apr 30, 2008 - 6 answers

PDF text import into OneNote

How can I import PDF files as editable text in Microsoft OneNote? [more inside]
posted by Pantalaimon on Apr 8, 2008 - 4 answers

Sort, OCR, export text from PDFs

I am in need of a server-side Linux or Unix-based software solution that will sort uploaded PDF files that can be PDF-native (that is, created in such a way that the text in the PDF is freely copyable), PDFs with embedded text over images (usually the result of a previous OCR job), and PDF-scanned, which are PDFs containing no text, only scanned images. The PDF-native files and PDFs with embedded text it will extract text from, the PDF-scanned files it will then OCR and export that text. [more inside]
posted by Mo Nickels on Mar 17, 2008 - 4 answers

How to automatically extract graphical content from PDFs?

Are there any software packages or toolkits (preferably open source) available that allow me to automatically extract graphical content (such as pictures, diagrams, graphs, etc.) from batches of PDFs? [more inside]
posted by elbaso on May 9, 2007 - 4 answers

How to convert a Dynatext book on CD-Rom to something else ?

How can I convert a Dynatext book (with SGML exporting capabilities) to something that can be viewed on a web server ? [more inside]
posted by vincentm on Mar 21, 2007 - 4 answers

Seeking real Java software

How can I find real Java software for a Motorola SLVR L7? I've only ever seen stupid games. I need a PDF reader, a text editor, and a French-English dictionary, or else dictd and a dict client. [more inside]
posted by jeffburdges on Feb 20, 2007 - 5 answers

How can I improve readability of a low-quality .pdf file?

I'm attempting to transcribe some scanned documents. The quality of the .pdf files is low. Is there a way to play around with the images in order to help me make out the mangled words? [more inside]
posted by moira on Jan 15, 2007 - 7 answers

pdf form text field lines limit

I want to limit the number of lines that users can input in a printable pdf form text field so that there is no overflow/scroll bars or auto-resizing of the text. How can i do that? [more inside]
posted by FidelDonson on Jul 19, 2006 - 2 answers

How can I convert a scanned pdf to searchable text?

I need to convert a scanned pdf to searchable text, without printing it out and scanning it back in using OCR. Also, I'd like a cheap or free solution since I'm not likely to use it often ever again.
posted by nomis on May 6, 2004 - 17 answers

