Problems extracting plain text from a pdf.
June 18, 2004 5:59 AM   Subscribe

Problems getting the text from a pdf . . . (mo' inside)

I'm trying to get the text from this pdf: (http://www.9-11commission.gov/hearings/hearing12/staff_statement_15.pdf)

When you save it as a .txt or .rtf or copy and paste all the text into a Word doc or something, it comes out as complete gobbledygook. Could it be like encrypted or security-protected or something? If yes, why? Thanks a stack!
posted by lazywhinerkid to Computers & Internet (10 answers total)
 
All the PDF tools I ran on it claim it's not encrypted, but I'm blowed if I can get anything to extract the text from it either. Very odd.
posted by reklaw at 6:53 AM on June 18, 2004


curious. i downloaded it to a unix machine and ran "pdftotext" on it, and also got rubbish out the end:
2YHUYLHZ RI WKH (QHP\ 6 WDII 6WDWHPHQW 1R  0HPEHUV RI WKH &RPPLVVLRQ ZLWK \RXU K...
since conversion from pdf to text is not easy (afaik, based on postscript, a pdf file is a program that describes how to make an image, so getting text from that either involves generating the image and using character recognition or, more likely, guessing how the text is encoded in the program) i would guess that the problem is not encryption (what's the point of that anyway - you can read if on the screen?) but simply that the way the pdf was generated is incompatible with the way it's converted to text. in otehr words it's likely just a technology mismatch.
posted by andrew cooke at 6:54 AM on June 18, 2004


Very strange. You could always print it out and scan it back in, I guess.
posted by transient at 6:58 AM on June 18, 2004


from the pdftotext man page:
BUGS
Some PDF files contain fonts whose encodings have been mangled beyond
recognition. There is no way (short of OCR) to extract text from these
files.

posted by andrew cooke at 7:01 AM on June 18, 2004


This online PDF parser understands it - but renders it as an image

http://view.samurajdata.se/
posted by magullo at 7:06 AM on June 18, 2004


the problem isn't displaying the contents, but getting them as plain text.
posted by andrew cooke at 7:18 AM on June 18, 2004


Statements 14 & 16 copy & paste just fine so I think it's corrupted...probably a font thing. Happens occasionally with PDFs.

Unless you e-mail the office & get the statement re-PDFed & re-posted the only way is OCRing which should be pretty easy if you have access to the software & scanner.

[On preview] Ah...as andrew cooke says...
posted by i_cola at 7:30 AM on June 18, 2004


Thanks to some guy's blog, here's the plain text. He has plain text from other 9/11 commission stuff, too.
posted by zsazsa at 7:43 AM on June 18, 2004


Response by poster: Wow -- cheers everyone!
posted by lazywhinerkid at 10:38 AM on June 18, 2004


Let's see.

2YHUYLHZ RI WKH (QHP\ 6WDII 6WDWHPHQW 1R
Overview of the Enemy Staff Statement No

Hmmm...

pdftotext staff_statement_15.pdf - | tr 'D-]$-<' 'a-zA-Z'

gets most of it, but it's missing numerals and most punctuation, and still has some junk characters in it. Maybe better than OCR, though. If you didn't have a plaintext version already.
posted by undecided at 9:09 PM on June 18, 2004


« Older Who said this thing about flyers?   |   Does Polysporin contain steroids? Newer »
This thread is closed to new comments.