Problems extracting plain text from a pdf.
June 18, 2004 5:59 AM Subscribe
Problems getting the text from a pdf . . . (mo' inside)
I'm trying to get the text from this pdf: (http://www.9-11commission.gov/hearings/hearing12/staff_statement_15.pdf)
When you save it as a .txt or .rtf or copy and paste all the text into a Word doc or something, it comes out as complete gobbledygook. Could it be like encrypted or security-protected or something? If yes, why? Thanks a stack!
I'm trying to get the text from this pdf: (http://www.9-11commission.gov/hearings/hearing12/staff_statement_15.pdf)
When you save it as a .txt or .rtf or copy and paste all the text into a Word doc or something, it comes out as complete gobbledygook. Could it be like encrypted or security-protected or something? If yes, why? Thanks a stack!
curious. i downloaded it to a unix machine and ran "pdftotext" on it, and also got rubbish out the end:
posted by andrew cooke at 6:54 AM on June 18, 2004
2YHUYLHZ RI WKH (QHP\ 6 WDII 6WDWHPHQW 1R 0HPEHUV RI WKH &RPPLVVLRQ ZLWK \RXU K...since conversion from pdf to text is not easy (afaik, based on postscript, a pdf file is a program that describes how to make an image, so getting text from that either involves generating the image and using character recognition or, more likely, guessing how the text is encoded in the program) i would guess that the problem is not encryption (what's the point of that anyway - you can read if on the screen?) but simply that the way the pdf was generated is incompatible with the way it's converted to text. in otehr words it's likely just a technology mismatch.
posted by andrew cooke at 6:54 AM on June 18, 2004
Very strange. You could always print it out and scan it back in, I guess.
posted by transient at 6:58 AM on June 18, 2004
posted by transient at 6:58 AM on June 18, 2004
from the pdftotext man page:
posted by andrew cooke at 7:01 AM on June 18, 2004
BUGS
Some PDF files contain fonts whose encodings have been mangled beyond
recognition. There is no way (short of OCR) to extract text from these
files.
posted by andrew cooke at 7:01 AM on June 18, 2004
This online PDF parser understands it - but renders it as an image
http://view.samurajdata.se/
posted by magullo at 7:06 AM on June 18, 2004
http://view.samurajdata.se/
posted by magullo at 7:06 AM on June 18, 2004
the problem isn't displaying the contents, but getting them as plain text.
posted by andrew cooke at 7:18 AM on June 18, 2004
posted by andrew cooke at 7:18 AM on June 18, 2004
Statements 14 & 16 copy & paste just fine so I think it's corrupted...probably a font thing. Happens occasionally with PDFs.
Unless you e-mail the office & get the statement re-PDFed & re-posted the only way is OCRing which should be pretty easy if you have access to the software & scanner.
[On preview] Ah...as andrew cooke says...
posted by i_cola at 7:30 AM on June 18, 2004
Unless you e-mail the office & get the statement re-PDFed & re-posted the only way is OCRing which should be pretty easy if you have access to the software & scanner.
[On preview] Ah...as andrew cooke says...
posted by i_cola at 7:30 AM on June 18, 2004
Thanks to some guy's blog, here's the plain text. He has plain text from other 9/11 commission stuff, too.
posted by zsazsa at 7:43 AM on June 18, 2004
posted by zsazsa at 7:43 AM on June 18, 2004
Let's see.
2YHUYLHZ RI WKH (QHP\ 6WDII 6WDWHPHQW 1R
Overview of the Enemy Staff Statement No
Hmmm...
gets most of it, but it's missing numerals and most punctuation, and still has some junk characters in it. Maybe better than OCR, though. If you didn't have a plaintext version already.
posted by undecided at 9:09 PM on June 18, 2004
2YHUYLHZ RI WKH (QHP\ 6WDII 6WDWHPHQW 1R
Overview of the Enemy Staff Statement No
Hmmm...
pdftotext staff_statement_15.pdf - | tr 'D-]$-<' 'a-zA-Z'
gets most of it, but it's missing numerals and most punctuation, and still has some junk characters in it. Maybe better than OCR, though. If you didn't have a plaintext version already.
posted by undecided at 9:09 PM on June 18, 2004
This thread is closed to new comments.
posted by reklaw at 6:53 AM on June 18, 2004