Problems extracting plain text from a pdf.
June 18, 2004 5:59 AM Subscribe

Problems getting the text from a pdf . . . (mo' inside)

I'm trying to get the text from this pdf: (http://www.9-11commission.gov/hearings/hearing12/staff_statement_15.pdf)

When you save it as a .txt or .rtf or copy and paste all the text into a Word doc or something, it comes out as complete gobbledygook. Could it be like encrypted or security-protected or something? If yes, why? Thanks a stack!

posted by lazywhinerkid to Computers & Internet (10 answers total)

All the PDF tools I ran on it claim it's not encrypted, but I'm blowed if I can get anything to extract the text from it either. Very odd.
posted by reklaw at 6:53 AM on June 18, 2004

curious. i downloaded it to a unix machine and ran "pdftotext" on it, and also got rubbish out the end:

2YHUYLHZ RI WKH (QHP\ 6 WDII 6WDWHPHQW 1R  0HPEHUV RI WKH &RPPLVVLRQ ZLWK \RXU K...

since conversion from pdf to text is not easy (afaik, based on postscript, a pdf file is a program that describes how to make an image, so getting text from that either involves generating the image and using character recognition or, more likely, guessing how the text is encoded in the program) i would guess that the problem is not encryption (what's the point of that anyway - you can read if on the screen?) but simply that the way the pdf was generated is incompatible with the way it's converted to text. in otehr words it's likely just a technology mismatch.
posted by andrew cooke at 6:54 AM on June 18, 2004

Very strange. You could always print it out and scan it back in, I guess.
posted by transient at 6:58 AM on June 18, 2004

from the pdftotext man page:

BUGS

       Some  PDF  files contain fonts whose encodings have been mangled beyond

       recognition.  There is no way (short of OCR) to extract text from these

       files.

posted by andrew cooke at 7:01 AM on June 18, 2004

This online PDF parser understands it - but renders it as an image

http://view.samurajdata.se/
posted by magullo at 7:06 AM on June 18, 2004

the problem isn't displaying the contents, but getting them as plain text.
posted by andrew cooke at 7:18 AM on June 18, 2004

Statements 14 & 16 copy & paste just fine so I think it's corrupted...probably a font thing. Happens occasionally with PDFs.

Unless you e-mail the office & get the statement re-PDFed & re-posted the only way is OCRing which should be pretty easy if you have access to the software & scanner.

[On preview] Ah...as andrew cooke says...
posted by i_cola at 7:30 AM on June 18, 2004

Thanks to some guy's blog, here's the plain text. He has plain text from other 9/11 commission stuff, too.
posted by zsazsa at 7:43 AM on June 18, 2004

Wow -- cheers everyone!
posted by lazywhinerkid at 10:38 AM on June 18, 2004

Let's see.
2YHUYLHZ RI WKH (QHP\ 6WDII 6WDWHPHQW 1R Overview of the Enemy Staff Statement No
Hmmm...



pdftotext staff_statement_15.pdf - | tr 'D-]$-<' 'a-zA-Z'

gets most of it, but it's missing numerals and most punctuation, and still has some junk characters in it. Maybe better than OCR, though. If you didn't have a plaintext version already.
posted by undecided at 9:09 PM on June 18, 2004

« Older Who said this thing about flyers? | Does Polysporin contain steroids? Newer »

This thread is closed to new comments.

Ask MetaFilter

Problems extracting plain text from a pdf.
June 18, 2004 5:59 AM Subscribe

Tags

Share

Problems extracting plain text from a pdf. June 18, 2004 5:59 AM Subscribe

Tags

Share

Problems extracting plain text from a pdf.
June 18, 2004 5:59 AM Subscribe