Convert PDF With Greek Characters To Plain Text File?
June 6, 2004 8:33 PM   Subscribe

Trying to convert a PDF file to plain text, but the result is hieroglyphics. [more inside]

The catch is, the PDF file (which is not secured/password-protected, in case anyone wonders) contains Greek text. Now, that shouldn't be a catch, because I've got the most common Greek fonts installed.

But.

I've tried Adobe's online PDF-to-HTML text, I've tried Acrobat's (v5.0) "Save as RTF" feature, I've selected the text manually in Acrobat & then pasted it to Word, all with the same ugly result: hieroglyphics appearing instead of actual text (screenshot).

This screenshot says that a font called "TimesNewRomanPSMT" is used (in that same screenshot, you can see some of the text I'm trying to copy). I've Googled it, and all three sites that were supposed to carry a Greek version of it are offline. (In case someone asks, I do have "Times New Roman" --and as I said, all the popular fonts in their Greek encodings-- installed, but it doesn't do the trick unfortunately.)

First of all, is this is a font issue, as I suspect, or not? And most importantly; can someone help me solve this problem? I'm stuck here.

Thanks, people.
posted by kchristidis to Computers & Internet (11 answers total)
 
that PSMT font is part of acrobat's resource fonts (it's in my acrobat reader's Contents folder, under MacOS/Resource/Font...are you sure you don't have it? I think that's not the problem. I can email it to you tho for you to see.
posted by amberglow at 8:44 PM on June 6, 2004


Hmmm... what does PDF to HTML produce? Acrobat should know more about its fonts than anything else, and HTML should be a good format to start from. What's the encoding of the HTML result?
posted by costas at 9:05 PM on June 6, 2004


Response by poster: Amberglow, my "C:\Program Files\Adobe\Acrobat 5.0\Resource\Font" directory doesn't seem to have that font. All it's got is some awkwardly named fonts (full of underscords, etc.), so it may be hidden somewhere there, but I can't tell. If it's not a hassle for you, my e-mail is in my profile page.

Costas (hello there, co-patriot), that's what I had in mind too, when I tried the PDF-to-HTML service. The results were not encouraging though. Hieroglyphics, and no "encoding" attribute in the resulting HTML file.

Here is the resulting HTML file (433kB), by the way, and here's the original PDF file (880kB). (I want to remove the background image from the original PDF file, because it makes it harder to read when printed, that's what the fuss is all about.)

[Also, because my host's going to kill me: if you're gonna try the online conversion tool, I'd appreciate it if you could use the "e-mail" option (and not point to the PDF file on my site).]
posted by kchristidis at 10:12 PM on June 6, 2004


Best answer: If it all you want is the image removed, use the touch up object tool. It'll take you some time as you'll have to do it on each page, but you can remove the bg image that way as well. Touch up object is the black arrow on the editing menu.
posted by Salmonberry at 11:02 PM on June 6, 2004


I'm sending them now...maybe it'll help.
posted by amberglow at 5:16 AM on June 7, 2004


Response by poster: Amberglow, I received your e-mail; unfortunately, my PC cannot recognize the "*.hqx" format, it's obviously a Mac-only thing. Thank you very much for your time though, I appreciate it.

Salmonberry, that works! Thanks for the tip, I went ahead and removed the images from each page (boy, was it tiring though; the image consisted of ~20 tiny images, so I had to revert to cutting the text above the image, removing the image, and then pasting the text back to its original place again..). So, the job is done; thank you very much, Salmonberry!

The original question still goes though (only now the answer to it is obviously not as crucial as before): why do I get hieroglyphics when copying and pasting the text to Word?

I may need to use this function in other cases in the future, so it'll be good to know.
posted by kchristidis at 12:13 PM on June 7, 2004


Best answer: kchris, I did some playing with the med.pdf, what you could do (if you need to in future) is set your view so that you select the bg image without touching the text. Hold down the shift key as you select and it'll just join everything together, with one big delete.

I don't know what would cause the font problem. You might want to try asking over at www.planetpdf.com and see if they can help.
posted by Salmonberry at 12:29 PM on June 7, 2004


when pasting text copied from a PDF into Word, instead of using the shortcut [ctrl] V or regular "paste" from the Edit toolbar . . . select "Paste Special" and choose one of the several options they give you. if it doesn't work, undo and choose another of the Paste Special options.

usually selecting "unformatted text" works for me.
posted by nyoki at 12:49 PM on June 7, 2004


Response by poster: Nyoki, I just tried all of the three ways available [Formatted Text (RTF), Unformatted Text, Unformatted Unicode Text], but none worked (they all gave me the same result as in that screenshot I had linked to.)

Salmonberry, wow, that sounds like a great way to do it, but I can't get it to work (yet); "set your view so that you select the bg image without touching the text" - how do you do that exactly?
posted by kchristidis at 1:51 PM on June 7, 2004


Best answer: I just mean magnify it so that you work with smaller surface areas. I viewed the document at 200% and then went to work.

So you first select the top.
Carefully select in between type and words, if you select type you get the type box, hold down shift while you do this.
Then, delete, and voila!
posted by Salmonberry at 7:16 PM on June 7, 2004


Response by poster: OK, thanks for the tip (and the screenshots)! I appreciate all of your help on this, Salmonberry.
posted by kchristidis at 12:05 PM on June 8, 2004


« Older What do utilities cost in Seattle?   |   Collection Agencies and Do Not Call Newer »
This thread is closed to new comments.