When ampersands attack.
September 18, 2011 9:10 AM   Subscribe

Help me extract text from a PDF of a Powerpoint presentation!

One of my lecturers posts lecture notes online in PDF format, which she has exported from Powerpoint. Each page consists of 4 slides. I would like to be able to copy the text from these slides for my own notes, but when I do so, I end up with something like:

Australian(Film(and(Radio ( ( (
• Emergence&of&Variety&theatre&from&the&1850s&
• The&Tivoli&Circuit:&Touring&shows,&Vaudeville,&Comic&
• Major&Comic&talent&

I've tried just copying the whole thing into Word and running 'search and replace' for various characters, but it takes a helluva long time, as does getting rid of the odd sentence wrapping and things like 'Tradi6on'. Things get even weirder when there are multiple columns to a slide.

Is there any way I can salvage this text? The slides are saved in nameofslide.pptx.pdf format. Any suggestions (short of asking the lecturer to change her file formatting) would be welcome!
posted by lovedbymarylane to Computers & Internet (8 answers total) 2 users marked this as a favorite
Try running it through the file converter at Zamzar.com
posted by HLD at 9:12 AM on September 18, 2011

Response by poster: Thanks for the suggestion! Unfortunately, converting it to a .doc just put each word into its own column and replaced the ampersands with underscores. Back to square one.
posted by lovedbymarylane at 9:43 AM on September 18, 2011

lovedbymarylane: "Thanks for the suggestion! Unfortunately, converting it to a .doc just put each word into its own column and replaced the ampersands with underscores. Back to square one."

What is the url for the notes?

I convert PDFs all the time, but can't really comment on your issue without seeing the source file. There are a number of methods to use, but each can be different depending on the original.
posted by lampshade at 10:42 AM on September 18, 2011

If you know anyone with Acrobat Pro, they should be able to copy/paste the text from there, providing the text is actually text and not part of a single large image.
posted by Thorzdad at 11:29 AM on September 18, 2011

Save the slides as images (rasterize them), then run them through OCR? This will get the text you can see on-screen only, no hidden stuff.
posted by misterbrandt at 12:53 PM on September 18, 2011

Try sending it to a GMail account, and view the attachment as HTML? Or else Google Docs has a convertor (Upload it to Docs, and make a copy in Google Docs format).
posted by Boobus Tuber at 2:55 PM on September 18, 2011

It might be easier to just ask your instructor for her original PP files, or for her to export the text in the presentation to a doc. Simple to do if she's willing.
posted by SuperSquirrel at 4:37 PM on September 18, 2011

Response by poster: Lampshade: thanks for the offer, but the files are only accessible with a current student's login information.

Thorzdad: unfortunately I don't!

Boobus Tuber: Google Docs got rid of the ampersands, but still left one word to a line and the weird 'tradi6on' thing, which is not ideal.

misterbrandt: I'm guessing even if OCR works, I'd have to input the images slide-by-slide because of the column issue, which is just as time-consuming as search-and-replace.

Thanks guys - I was hoping there was some quick fix I didn't know about, but I guess I'll have to speak to my lecturer about it!
posted by lovedbymarylane at 12:30 AM on September 19, 2011

« Older help me pick a machine.   |   Show ALL THE THINGS Newer »
This thread is closed to new comments.