PDF character encoding problem
September 21, 2010 8:34 AM Subscribe

I have a series of text-based PDF documents that I need to export to Word, but the character encoding is messed up whether I use Reader's save to text or simply copy-and-paste. Help!

There are 5 files total, in Portuguese, about 20,000 words worth of text, that I need to translate. The translation process is much quicker and cleaner if I can export the original text to a Word file and use my usual translation software tools. However, regardless of whether I use Adobe Reader's Save to Text option or simply copy and paste text into a Word file, the character encoding comes out completely garbled.

Again, these are Acrobat-created, primarily text documents (with a handful of embedded images) rather than scanned image PDFs.

When opening the exported .txt files into Word, I have tried every possible encoding in MS Word's file conversion library (UTC-8, etc.), to no avail.

There are no security restrictions on the PDF files against copying or page extraction.

Any suggestions or ideas will be greatly appreciated!

posted by drlith to Computers & Internet (9 answers total)

How do you export?

Do you select 'Format: Text (plain)' and then go to 'Settings...' and choose an encoding there?
posted by mazola at 9:03 AM on September 21, 2010

Character encoding in PDF can be completely arbitrary, so it depends entirely on what was used to prepare your documents. A short sample would be helpful.
posted by scruss at 9:12 AM on September 21, 2010

I am selecting "Save as Text" from the File menu. I only have Adobe Reader, not Acrobat. I don't see any way to change the encoding when exporting to text.
posted by drlith at 9:12 AM on September 21, 2010

The standard, or at least best, way to export PDFs to Word is to use software like ABBYY FineReader. But it's a $300 program.

Fortunately I have it on my computer.

If you want to email me the documents in question (assuming they're not confidential) I can convert them to Word documents.
posted by dfriedman at 9:17 AM on September 21, 2010

Sorry, my email address is available on my profile.
posted by dfriedman at 9:17 AM on September 21, 2010

If dfriedman's doesn't work, I have other converters for PDF to Word. Same offer applies.

.
posted by lampshade at 9:30 AM on September 21, 2010

Try Infix, it converts things really well.
posted by Brent Parker at 11:46 AM on September 21, 2010

How much formatting do you need to preserve? What if you just selected it with the text select tool, and then copy/pasted it?
posted by Galaxor Nebulon at 12:32 PM on September 21, 2010

Thanks for the offers and suggestions, everyone.

I've actually got a license for Readiris Pro, but due to a recent hard drive failure hadn't jumped through the hoops needed to reinstall and relicense it. I would have gone that route sooner if not for the fact that when I've previously had the "won't copy/paste and won't export" problem with other files, Readiris had failed be able to convert them.*

But dfriedman's suggestion made me think that maybe I should go ahead and jump through the hoops, and after I reinstalled/relicensed Readiris, I was able to get the software to convert the file into a workable Word document with only minor glitches.

*In retrospect, I think the previous times I've had this problem, it has been due to file security issues rather than encoding issues.

(For what it's worth, I've been generally very happy with Readiris and use it fairly regularly in my work, and its $80 price tag is a lot more attractive than AABBY FineReader's $300 one.)
posted by drlith at 1:39 PM on September 21, 2010

« Older Playas in De Wallen | They're pretty friendly, actually... Newer »

This thread is closed to new comments.

Ask MetaFilter

PDF character encoding problem
September 21, 2010 8:34 AM Subscribe

Tags

Share

PDF character encoding problem September 21, 2010 8:34 AM Subscribe

Tags

Share

PDF character encoding problem
September 21, 2010 8:34 AM Subscribe