<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

      <title>Comments on: Remove Japanese from dual-language PDF?</title>
      <link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF/</link>
      <description>Comments on Ask MetaFilter post Remove Japanese from dual-language PDF?</description>
	  	  <pubDate>Thu, 05 Jul 2007 13:53:49 -0800</pubDate>
      <lastBuildDate>Thu, 05 Jul 2007 13:53:49 -0800</lastBuildDate>
      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>

<item>
  	<title>Question: Remove Japanese from dual-language PDF?</title>
  	<link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF</link>	
  	<description>How to strip Kanji characters from .pdf? &lt;br /&gt;&lt;br /&gt; (Asking for a friend) I have a large amount of .pdfs containing both English and Japanese writing. Is there an automated way in which I can take out the Japanese writing in kanji, leaving only the English? Some kind of awesome regexp I can plug into Acrobat?&lt;br&gt;
&lt;br&gt;
I have access to XP and OSX, and the entire CS3 suite on both, so surely something there must be able to help.</description>
  	<guid isPermaLink="false">post:ask.metafilter.com,2008:site.66309</guid>
  	<pubDate>Thu, 05 Jul 2007 13:44:39 -0800</pubDate>
  	<dc:creator>djgh</dc:creator>
	
	<category>kanji</category>
	
	<category>english</category>
	
	<category>japanese</category>
	
	<category>pdf</category>
	
	<category>adobe</category>
	
	<category>acrobat</category>
	
	<category>automate</category>
	
</item>
<item>
  	<title>By: Aidan Kehoe</title>
  	<link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF#995334</link>	
  	<description>Depending on what your friend wants to do with the English text next, &lt;a href=&quot;http://www.foolabs.com/xpdf/download.html&quot;&gt;pdftotext&lt;/a&gt; may do what they want. See the paragraph starting with &lt;b&gt;x86, DOS/Win32.&lt;/b&gt;</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.66309-995334</guid>
  	<pubDate>Thu, 05 Jul 2007 13:53:49 -0800</pubDate>
  	<dc:creator>Aidan Kehoe</dc:creator>
</item>
<item>
  	<title>By: miss lynnster</title>
  	<link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF#995371</link>	
  	<description>If you open a pdf in illustrator, you can probably edit it. As long as they are vector art, you should be able to select the characters you want gone &amp;amp; hit delete.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.66309-995371</guid>
  	<pubDate>Thu, 05 Jul 2007 14:29:02 -0800</pubDate>
  	<dc:creator>miss lynnster</dc:creator>
</item>
<item>
  	<title>By: djgh</title>
  	<link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF#995393</link>	
  	<description>&lt;strong&gt;Aidan&lt;/strong&gt; - I think he wants to leave the PDF as it is after having removed the Japanese.&lt;br&gt;
&lt;br&gt;
&lt;strong&gt;miss lynnster&lt;/strong&gt; - I believe that&apos;s what&apos;s happening at the moment, but he has 300 pages so would like to try and automate the process as much as possible.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.66309-995393</guid>
  	<pubDate>Thu, 05 Jul 2007 14:51:10 -0800</pubDate>
  	<dc:creator>djgh</dc:creator>
</item>
<item>
  	<title>By: molybdenum</title>
  	<link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF#995509</link>	
  	<description>I had a similar problem trying to extract text in Czech from PDFs.  I found that pdftotext corrupted all the non-ASCII characters in Czech in non-predictable ways (sometimes, it would turn a into an &apos;s&apos;, sometimes a &#xe7;, etc.)&lt;br&gt;
&lt;br&gt;
One thing to try is: convert the whole thing to text, and then try to find blocks of &apos;corrupt&apos; text (which is what the kanji will appear as), and regex them away.  Defining what &apos;corrupt&apos; means is the tricky part.  If you found that all kanji translate into a certain range of ASCII characters, and that range didn&apos;t intersect the range of English, you could use regex groups to strip out those blocks, e.g. ([\002-\030].*[\002-\030]).  But if some kanji translate into regular ASCII characters, then I don&apos;t know what to do.&lt;br&gt;
&lt;br&gt;
That&apos;s the best I can think of.  If you find an elegant solution to this problem, I&apos;d love to hear it.&lt;br&gt;
&lt;br&gt;
If there were a pdftounicode, this would be easy.  Anyone know of one?</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.66309-995509</guid>
  	<pubDate>Thu, 05 Jul 2007 17:06:44 -0800</pubDate>
  	<dc:creator>molybdenum</dc:creator>
</item>
<item>
  	<title>By: molybdenum</title>
  	<link>http://ask.metafilter.com/66309/Remove-Japanese-from-duallanguage-PDF#995510</link>	
  	<description>On re-read: it sounds like you want to strip the Japanese out of the PDF in place.  My solution would result in a separate text file.  If you don&apos;t mind pasting that text file back into Distiller or something, it could work.  If you need to make the change in-place, then ignore my earlier comment. :)</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.66309-995510</guid>
  	<pubDate>Thu, 05 Jul 2007 17:09:42 -0800</pubDate>
  	<dc:creator>molybdenum</dc:creator>
</item>

    </channel>
</rss>
