PDF character encoding problem
September 21, 2010 8:34 AM Subscribe
I have a series of text-based PDF documents that I need to export to Word, but the character encoding is messed up whether I use Reader's save to text or simply copy-and-paste. Help!
There are 5 files total, in Portuguese, about 20,000 words worth of text, that I need to translate. The translation process is much quicker and cleaner if I can export the original text to a Word file and use my usual translation software tools. However, regardless of whether I use Adobe Reader's Save to Text option or simply copy and paste text into a Word file, the character encoding comes out completely garbled.
Again, these are Acrobat-created, primarily text documents (with a handful of embedded images) rather than scanned image PDFs.
When opening the exported .txt files into Word, I have tried every possible encoding in MS Word's file conversion library (UTC-8, etc.), to no avail.
There are no security restrictions on the PDF files against copying or page extraction.
Any suggestions or ideas will be greatly appreciated!
posted by drlith to computers & internet (9 answers total)
Do you select 'Format: Text (plain)' and then go to 'Settings...' and choose an encoding there?
posted by mazola at 9:03 AM on September 21, 2010