How to copy/paste Cyrillic characters from PDF?
January 19, 2013 7:23 PM   Subscribe

For years, I've accepted as fact that Russian-language PDFs don't play nice with other programs. That if you tried to paste copied text into Word or Notepad, you'd get gibberish characters. (I work on two laptops, one with Windows Vista and Acrobat, the other Windows 7 and Reader.) Are there any clever workarounds or programs I should know about?

Googling this issue only confused me more. Someone on one forum mentioned installing "freeware" Cyrillic fonts, but searching for that led me to some sketchy sites with (admittedly cool-looking) skateboarder-style graphic fonts, but I can't imagine that would help...
posted by lily_bart to Computers & Internet (7 answers total) 1 user marked this as a favorite
If you go to a Russian language web site like in Internet Explorer and cut and paste into Word or Notepad do you get gibberish?
posted by XMLicious at 8:02 PM on January 19, 2013

Can you link to a PDF that is giving you trouble?

Most commonly used fonts contain the full Cyrillic alphabet. You might just be having encoding problems.
posted by hyperbovine at 8:03 PM on January 19, 2013

Just as a general answer to your question, a clever program for working with Unicode is BabelPad.

If you install the latest version, open it, and go to Tools → Font Analysis... then in the top right of the dialog under "List All Unicode Blocks Covered by this Font" will be a dropdown you can pick any of the fonts on your system from. If you pick one "Cyrillic" will show up in the list beneath if it's available.
posted by XMLicious at 8:25 PM on January 19, 2013 [1 favorite]

Response by poster: XMLicious, I don't have any problem using Cyrillic fonts elsewhere, it's just trying to copy/paste from PDF.

hyperbovine, they're client files so unfortunately I can't share them, but your question made me look for other PDFs to try. I found this one at random, and I can successfully paste it into Word! So does this mean it's an encoding issue on their end?
posted by lily_bart at 8:53 PM on January 19, 2013

It means that the PDFs you're having trouble with are probably in a pre-Unicode 8-bit Cyrillic encoding like CP-1251.

So you'd just need a tool to convert from that to Unicode; I'm noticing that at the end of the above Wikipedia article there's a link to something called the Universal Cyrillic Decoder.
posted by XMLicious at 9:11 PM on January 19, 2013 [1 favorite]

Something I've done when trying to brute-force figure out encoding problems is to cut-and-paste into a plain old text file (i.e. use Notepad, not Word), and then open that file up in a browser. Some browsers will auto-detect some of the funky encodings and if not they have a menu option somewhere that lets you select from a list of encodings. Then just keep trying different encodings until you find the right one. You should then probably be able to cut-and-paste from the browser into word and have it work.
posted by zengargoyle at 11:38 PM on January 19, 2013

Art. Lebedev Studio's Decoder automatically converts weird old encodings (cp1251, koi8-r) to UTF-8.

Also, tell your client they should upgrade their software and stop using old encodings.
posted by floatboth at 6:53 AM on January 20, 2013

« Older PLS assuage my guilt about refusing to beat my...   |   Help me become a social media marketing expert Newer »
This thread is closed to new comments.