Remove Japanese from dual-language PDF?
July 5, 2007 1:44 PM Subscribe
How to strip Kanji characters from .pdf?
(Asking for a friend) I have a large amount of .pdfs containing both English and Japanese writing. Is there an automated way in which I can take out the Japanese writing in kanji, leaving only the English? Some kind of awesome regexp I can plug into Acrobat?
I have access to XP and OSX, and the entire CS3 suite on both, so surely something there must be able to help.
(Asking for a friend) I have a large amount of .pdfs containing both English and Japanese writing. Is there an automated way in which I can take out the Japanese writing in kanji, leaving only the English? Some kind of awesome regexp I can plug into Acrobat?
I have access to XP and OSX, and the entire CS3 suite on both, so surely something there must be able to help.
If you open a pdf in illustrator, you can probably edit it. As long as they are vector art, you should be able to select the characters you want gone & hit delete.
posted by miss lynnster at 2:29 PM on July 5, 2007
posted by miss lynnster at 2:29 PM on July 5, 2007
Response by poster: Aidan - I think he wants to leave the PDF as it is after having removed the Japanese.
miss lynnster - I believe that's what's happening at the moment, but he has 300 pages so would like to try and automate the process as much as possible.
posted by djgh at 2:51 PM on July 5, 2007
miss lynnster - I believe that's what's happening at the moment, but he has 300 pages so would like to try and automate the process as much as possible.
posted by djgh at 2:51 PM on July 5, 2007
I had a similar problem trying to extract text in Czech from PDFs. I found that pdftotext corrupted all the non-ASCII characters in Czech in non-predictable ways (sometimes, it would turn š into an 's', sometimes a ç, etc.)
One thing to try is: convert the whole thing to text, and then try to find blocks of 'corrupt' text (which is what the kanji will appear as), and regex them away. Defining what 'corrupt' means is the tricky part. If you found that all kanji translate into a certain range of ASCII characters, and that range didn't intersect the range of English, you could use regex groups to strip out those blocks, e.g. ([\002-\030].*[\002-\030]). But if some kanji translate into regular ASCII characters, then I don't know what to do.
That's the best I can think of. If you find an elegant solution to this problem, I'd love to hear it.
If there were a pdftounicode, this would be easy. Anyone know of one?
posted by molybdenum at 5:06 PM on July 5, 2007
One thing to try is: convert the whole thing to text, and then try to find blocks of 'corrupt' text (which is what the kanji will appear as), and regex them away. Defining what 'corrupt' means is the tricky part. If you found that all kanji translate into a certain range of ASCII characters, and that range didn't intersect the range of English, you could use regex groups to strip out those blocks, e.g. ([\002-\030].*[\002-\030]). But if some kanji translate into regular ASCII characters, then I don't know what to do.
That's the best I can think of. If you find an elegant solution to this problem, I'd love to hear it.
If there were a pdftounicode, this would be easy. Anyone know of one?
posted by molybdenum at 5:06 PM on July 5, 2007
On re-read: it sounds like you want to strip the Japanese out of the PDF in place. My solution would result in a separate text file. If you don't mind pasting that text file back into Distiller or something, it could work. If you need to make the change in-place, then ignore my earlier comment. :)
posted by molybdenum at 5:09 PM on July 5, 2007
posted by molybdenum at 5:09 PM on July 5, 2007
This thread is closed to new comments.
posted by Aidan Kehoe at 1:53 PM on July 5, 2007