Remove Japanese from dual-language PDF?
July 5, 2007 1:44 PM   Subscribe

How to strip Kanji characters from .pdf?

(Asking for a friend) I have a large amount of .pdfs containing both English and Japanese writing. Is there an automated way in which I can take out the Japanese writing in kanji, leaving only the English? Some kind of awesome regexp I can plug into Acrobat?

I have access to XP and OSX, and the entire CS3 suite on both, so surely something there must be able to help.
posted by djgh to Computers & Internet (5 answers total)
Depending on what your friend wants to do with the English text next, pdftotext may do what they want. See the paragraph starting with x86, DOS/Win32.
posted by Aidan Kehoe at 1:53 PM on July 5, 2007

If you open a pdf in illustrator, you can probably edit it. As long as they are vector art, you should be able to select the characters you want gone & hit delete.
posted by miss lynnster at 2:29 PM on July 5, 2007

Response by poster: Aidan - I think he wants to leave the PDF as it is after having removed the Japanese.

miss lynnster - I believe that's what's happening at the moment, but he has 300 pages so would like to try and automate the process as much as possible.
posted by djgh at 2:51 PM on July 5, 2007

I had a similar problem trying to extract text in Czech from PDFs. I found that pdftotext corrupted all the non-ASCII characters in Czech in non-predictable ways (sometimes, it would turn š into an 's', sometimes a ç, etc.)

One thing to try is: convert the whole thing to text, and then try to find blocks of 'corrupt' text (which is what the kanji will appear as), and regex them away. Defining what 'corrupt' means is the tricky part. If you found that all kanji translate into a certain range of ASCII characters, and that range didn't intersect the range of English, you could use regex groups to strip out those blocks, e.g. ([\002-\030].*[\002-\030]). But if some kanji translate into regular ASCII characters, then I don't know what to do.

That's the best I can think of. If you find an elegant solution to this problem, I'd love to hear it.

If there were a pdftounicode, this would be easy. Anyone know of one?
posted by molybdenum at 5:06 PM on July 5, 2007

On re-read: it sounds like you want to strip the Japanese out of the PDF in place. My solution would result in a separate text file. If you don't mind pasting that text file back into Distiller or something, it could work. If you need to make the change in-place, then ignore my earlier comment. :)
posted by molybdenum at 5:09 PM on July 5, 2007

« Older Traveller looking for ideas to further finance...   |   Learning to eat better/more Newer »
This thread is closed to new comments.