In-place replacement of charset barf in static html files?
November 24, 2008 7:39 PM   Subscribe

How do I replace non-printable barf from charset mangling with sed/awk or perl? I have a collection of flat html files which at some point in the past got corrupted charset-wise. You can see an example broken file here. Apache serves them up utf-8 in a clearly broken way, but dropping in a .htaccess to force iso-8859-1 doesn't help (see here) and ditto windows-1252 (see here). When I open the files in vim or less, I see "<89>" as if it were one char for what should be ’ (right curly quotation mark). I don't know how to replace that in a programmatic way since it's not a literal bracket-eight-nine-bracket. Halp?
posted by tarheelcoxn to Computers & Internet (11 answers total)
tr -cd '\11\12\40-\176'
posted by 31d1 at 7:57 PM on November 24, 2008

31d1 can you unpack that a bit for me? I have another string I'd like to replace, for example: <D6> with — (em dash). I'd like a generalized answer if possible. Thanks in advance!
posted by tarheelcoxn at 8:01 PM on November 24, 2008

That just removes high ascii, replacement is a different beast - but you can probably use a similar notation in sed, if you figure out what the \xx is for em-dash for example. Try apt-get install ascii for some conversion charts maybe. I've only really needed to just strip the barf out myself, and I guard that little tr snippet like gold, but I haven't had to get deeper than that.
posted by 31d1 at 8:05 PM on November 24, 2008

as far as unpacking that, -d means delete, -c means complement, so it's saying delete everything except for \11, \12, and \40-\176.
posted by 31d1 at 8:11 PM on November 24, 2008

tr is short for 'translate', so you can use it to replace characters as well as delete them. For instance, to replace all the <89>s with apostrophes, you'd do

cat a_fable.php | tr '\211' "'" > a_fable.php.fixed

(had to use double quotes in the second case because it's enclosing a single quote, of course)

But tr doesn't handle hex, so you'd have to hand-convert the numbers to their octal equivalents. I'd suggest using sed instead:

cat a_fable.php | sed "s,\x89,'," > a_fable.php.fixed
posted by inkyz at 8:27 PM on November 24, 2008

A quick Googling doesn't give any hints as to what encoding that might be — if it did, I'd use iconv rather than replacing each character by hand. But using sed:

sed 's/\x89/\&rsquo;/g' a_fable.php > a_fable.php.corrected

should do the trick, if you want to use HTML entities — of course you can replace \&rsquo; with a literal UTF-8 right single quote or what have you.

On preview: inkyz has it.
posted by enn at 8:32 PM on November 24, 2008

Thanks so much to both of you. Based on your feedback I ran two lines and I think things are mostly fixed. Two lines were:
  find . -type f -exec sed -i.old "s,\x89,’," {} \;
  find . -type f -exec sed -i.broken "s,\xD6,—," {} \;
more digging...
posted by tarheelcoxn at 8:51 PM on November 24, 2008

eek! I'm dumb. DO NOT use "-type f" because sed will happily break your .gif, .png, and other files. whoops! Had to restore those from backups.
posted by tarheelcoxn at 9:06 PM on November 24, 2008

I assume you've figured this out, but you can do -type f -name '*.php' to get files (ie, not directories) ending in '.php'.
posted by inkyz at 9:23 PM on November 24, 2008

At first when I was reading this I thought I was sure of the problem: the difference between Windows Latin 1 (code page 1252) and ISO Latin 1, which are *not* the same. That can cause corruption of a very specific bunch of characters, because Windows Latin 1 puts stuff in values that are nonprinting control codes in ISO Latin 1. Specifically it's characters in the range 128 - 159. Converting these "gremlins" to Unicode is a well-known problem with many solutions available in your choice of modern programming/scripting languages.

Unfortunately, your mention of hex D6 threw that theory out the window, since D6 is 214, outside the problematic Windows/ISO Latin-1 range.

If you want to do replacement rather than just stripping, it's pretty important to identify the character set that the files are currently in, or at least the problematic values are from. Otherwise you'll end up having to go through the files by hand, identifying each value and picking the character you actually want to replace it with. (That's a valid solution but it'll be time consuming, and I got the feeling you were looking for a solution that could be automated.)

I couldn't find a charset that had 0xD6 mapping to the em dash, though, if that was an actual example from one of your files...that doesn't bode well for doing this automatically. Unless you are absolutely sure that 0xD6 always will map to —, in which case you could build up a pseudo-charset of your own...but it would be odd to have a consistent mapping that doesn't match an existing character set.

Anyway, if you do make up a translation table, the Python script I linked to earlier ought to provide an example of a method to find and replace characters with appropriate Unicode values. You could also use sed if you wanted to do it one character at a time, but I'm not sure about what sed's Unicode / non-ASCII support is like, which is an issue since you'd probably want to replace some characters with multi-byte UTF8 equivalents, and that could get ugly if there's no built-in Unicode support.
posted by Kadin2048 at 9:57 PM on November 24, 2008

oof. it's amazing how many different kinds of broken there were in that archive. This seems to have been the last one needing repair. I owe both 31d1 and inkyz beers if you're ever in Carrboro, NC. mefimail, twitter, gmail, etc.

Now a miracle will happen and nobody else will notice what I just noticed about all the .ram audio links from the '90s....
posted by tarheelcoxn at 10:30 PM on November 24, 2008

« Older Turn the my brain from Spock-like robot into...   |   Brain Drain Newer »
This thread is closed to new comments.