<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: In-place replacement of charset barf in static html files?</title>
	<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files/</link>
	<description>Comments on Ask MetaFilter post In-place replacement of charset barf in static html files?</description>
	<pubDate>Mon, 24 Nov 2008 19:57:44 -0800</pubDate>
	<lastBuildDate>Mon, 24 Nov 2008 19:57:44 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: In-place replacement of charset barf in static html files?</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files</link>	
		<description>How do I replace non-printable barf from charset mangling with sed/awk or perl? I have a collection of flat html files which at some point in the past got corrupted charset-wise. You can see an example broken file &lt;a href=&quot;http://www.ibiblio.org/ipa/poems/wilbur/a_fable.php&quot;&gt;here&lt;/a&gt;. Apache serves them up utf-8 in a clearly broken way, but dropping in a .htaccess to force iso-8859-1 doesn&apos;t help (see &lt;a href=&quot;http://www.ibiblio.org/ipa/poems/wilbur/iso-8859-1/a_fable.php&quot;&gt;here&lt;/a&gt;) and ditto windows-1252 (see &lt;a href=&quot;http://www.ibiblio.org/ipa/poems/wilbur/windows-1252/a_fable.php&quot;&gt;here&lt;/a&gt;). When I open the files in vim or less, I see &quot;&amp;lt;89&amp;gt;&quot; as if it were one char for what should be &apos; (right curly quotation mark). I don&apos;t know how to replace that in a programmatic way since it&apos;s not a literal bracket-eight-nine-bracket. Halp?&lt;/8&gt;</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2008:site.107654</guid>
		<pubDate>Mon, 24 Nov 2008 19:39:09 -0800</pubDate>
		<dc:creator>tarheelcoxn</dc:creator>
		
			<category>charset</category>
		
			<category>mangling</category>
		
			<category>resolved</category>
		
	</item> <item>
		<title>By: 31d1</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551563</link>	
		<description>tr -cd &apos;\11\12\40-\176&apos;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551563</guid>
		<pubDate>Mon, 24 Nov 2008 19:57:44 -0800</pubDate>
		<dc:creator>31d1</dc:creator>
	</item><item>
		<title>By: tarheelcoxn</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551564</link>	
		<description>31d1 can you unpack that a bit for me? I have another string I&apos;d like to replace, for example: &amp;lt;D6&amp;gt; with &#8212; (em dash). I&apos;d like a generalized answer if possible. Thanks in advance!</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551564</guid>
		<pubDate>Mon, 24 Nov 2008 20:01:08 -0800</pubDate>
		<dc:creator>tarheelcoxn</dc:creator>
	</item><item>
		<title>By: 31d1</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551569</link>	
		<description>That just removes high ascii, replacement is a different beast - but you can probably use a similar notation in sed, if you figure out what the \xx is for em-dash for example. Try apt-get install ascii for some conversion charts maybe. I&apos;ve only really needed to just strip the barf out myself, and I guard that little tr snippet like gold, but I haven&apos;t had to get deeper than that.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551569</guid>
		<pubDate>Mon, 24 Nov 2008 20:05:58 -0800</pubDate>
		<dc:creator>31d1</dc:creator>
	</item><item>
		<title>By: 31d1</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551572</link>	
		<description>as far as unpacking that, -d means delete, -c means complement, so it&apos;s saying delete everything except for \11, \12, and \40-\176.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551572</guid>
		<pubDate>Mon, 24 Nov 2008 20:11:11 -0800</pubDate>
		<dc:creator>31d1</dc:creator>
	</item><item>
		<title>By: inkyz</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551598</link>	
		<description>tr is short for &apos;translate&apos;, so you can use it to replace characters as well as delete them. For instance, to replace all the &amp;lt;89&amp;gt;s with apostrophes, you&apos;d do&lt;br&gt;
&lt;br&gt;
cat a_fable.php | tr &apos;\211&apos; &quot;&apos;&quot; &amp;gt; a_fable.php.fixed&lt;br&gt;
&lt;br&gt;
(had to use double quotes in the second case because it&apos;s enclosing a single quote, of course)&lt;br&gt;
&lt;br&gt;
But tr doesn&apos;t handle hex, so you&apos;d have to hand-convert the numbers to their octal equivalents. I&apos;d suggest using sed instead:&lt;br&gt;
&lt;br&gt;
cat a_fable.php | sed &quot;s,\x89,&apos;,&quot; &amp;gt; a_fable.php.fixed</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551598</guid>
		<pubDate>Mon, 24 Nov 2008 20:27:03 -0800</pubDate>
		<dc:creator>inkyz</dc:creator>
	</item><item>
		<title>By: enn</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551608</link>	
		<description>A quick Googling doesn&apos;t give any hints as to what encoding that might be &amp;mdash; if it did, I&apos;d use &lt;code&gt;iconv&lt;/code&gt; rather than replacing each character by hand. But using &lt;code&gt;sed&lt;/code&gt;:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;sed &apos;s/\x89/\&amp;amp;rsquo;/g&apos; a_fable.php &amp;gt; a_fable.php.corrected&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
should do the trick, if you want to use HTML entities &amp;mdash; of course you can replace \&amp;amp;rsquo; with a literal UTF-8 right single quote or what have you.&lt;br&gt;
&lt;br&gt;
On preview: inkyz has it.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551608</guid>
		<pubDate>Mon, 24 Nov 2008 20:32:38 -0800</pubDate>
		<dc:creator>enn</dc:creator>
	</item><item>
		<title>By: tarheelcoxn</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551634</link>	
		<description>Thanks so much to both of you. Based on your feedback I ran two lines and I think things are mostly fixed. Two lines were:&lt;br&gt;
&lt;br&gt;
&lt;pre&gt;&lt;br&gt;
  find . -type f -exec sed -i.old &quot;s,\x89,&apos;,&quot; {} \;&lt;br&gt;
&lt;/pre&gt;&lt;br&gt;
and:&lt;br&gt;
&lt;pre&gt;&lt;br&gt;
  find . -type f -exec sed -i.broken &quot;s,\xD6,&#8212;,&quot; {} \;&lt;br&gt;
&lt;/pre&gt;&lt;br&gt;
more digging...</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551634</guid>
		<pubDate>Mon, 24 Nov 2008 20:51:14 -0800</pubDate>
		<dc:creator>tarheelcoxn</dc:creator>
	</item><item>
		<title>By: tarheelcoxn</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551644</link>	
		<description>eek! I&apos;m dumb. DO NOT use &quot;-type f&quot; because sed will happily break your .gif, .png, and other files. whoops! Had to restore those from backups.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551644</guid>
		<pubDate>Mon, 24 Nov 2008 21:06:14 -0800</pubDate>
		<dc:creator>tarheelcoxn</dc:creator>
	</item><item>
		<title>By: inkyz</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551653</link>	
		<description>I assume you&apos;ve figured this out, but you can do -type f -name &apos;*.php&apos; to get files (ie, not directories) ending in &apos;.php&apos;.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551653</guid>
		<pubDate>Mon, 24 Nov 2008 21:23:52 -0800</pubDate>
		<dc:creator>inkyz</dc:creator>
	</item><item>
		<title>By: Kadin2048</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551691</link>	
		<description>At first when I was reading this I thought I was sure of the problem: the difference between Windows Latin 1 (code page 1252) and ISO Latin 1, which are *not* the same.  That can cause corruption of a very specific bunch of characters, because Windows Latin 1 puts stuff in values that are nonprinting control codes in ISO Latin 1.  Specifically it&apos;s characters in the range 128 - 159.  Converting these &quot;gremlins&quot; to Unicode is a &lt;a href=&quot;http://effbot.org/zone/unicode-gremlins.htm&quot;&gt;well-known problem&lt;/a&gt; with many solutions available in your choice of modern programming/scripting languages.&lt;br&gt;
&lt;br&gt;
Unfortunately, your mention of hex D6 threw that theory out the window, since D6 is 214, outside the problematic Windows/ISO Latin-1 range.&lt;br&gt;
&lt;br&gt;
If you want to do replacement rather than just stripping, it&apos;s pretty important to identify the character set that the files are currently in, or at least the problematic values are from.  Otherwise you&apos;ll end up having to go through the files by hand, identifying each value and picking the character you actually want to replace it with.  (That&apos;s a valid solution but it&apos;ll be time consuming, and I got the feeling you were looking for a solution that could be automated.)&lt;br&gt;
&lt;br&gt;
I couldn&apos;t find a charset that had 0xD6 mapping to the em dash, though, if that was an actual example from one of your files...that doesn&apos;t bode well for doing this automatically.  Unless you are absolutely sure that 0xD6 &lt;i&gt;always&lt;/i&gt; will map to &#8212;, in which case you could build up a pseudo-charset of your own...but it would be odd to have a consistent mapping that doesn&apos;t match an existing character set.&lt;br&gt;
&lt;br&gt;
Anyway, if you do make up a translation table, the Python script I linked to earlier ought to provide an example of a method to find and replace characters with appropriate Unicode values.  You could also use sed if you wanted to do it one character at a time, but I&apos;m not sure about what  sed&apos;s Unicode / non-ASCII support is like, which is an issue since you&apos;d probably want to replace some characters with multi-byte UTF8 equivalents, and that could get ugly if there&apos;s no built-in Unicode support.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551691</guid>
		<pubDate>Mon, 24 Nov 2008 21:57:33 -0800</pubDate>
		<dc:creator>Kadin2048</dc:creator>
	</item><item>
		<title>By: tarheelcoxn</title>
		<link>http://ask.metafilter.com/107654/Inplace-replacement-of-charset-barf-in-static-html-files#1551725</link>	
		<description>oof. it&apos;s amazing how many different kinds of broken there were in that archive. &lt;a href=&quot;http://www.ibiblio.org/ipa/poems/pinsky/impossible_to_tell.php&quot;&gt;This&lt;/a&gt; seems to have been the last one needing repair. I owe both 31d1 and inkyz beers if you&apos;re ever in Carrboro, NC. mefimail, twitter, gmail, etc.&lt;br&gt;
&lt;br&gt;
Now a miracle will happen and nobody else will notice what I just noticed about all the .ram audio links from the &apos;90s....</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.107654-1551725</guid>
		<pubDate>Mon, 24 Nov 2008 22:30:56 -0800</pubDate>
		<dc:creator>tarheelcoxn</dc:creator>
	</item>
	</channel>
</rss>
