A Grave Problem with HTML Entities.
April 20, 2006 6:40 PM
Subscribe
Is there a Unix/Mac OS X utility that can will batch fix badly-coded HTML entities?
So I'm trying to convert 7 years of HTML files from hand-coded and hand-managed on Pagemill 3.0 for Windows. They're a mess, but that's okay for the most part. I'm converting them from HTML-esque to XHTML 4.0. They will eventually get redone in a custom XML schema.
HTML Tidy is generally doing a bang-up job, but I'm having trouble with a bunch of files that have poorly implemented HTML entities.
The problem is that Pagemill took the Windows character set and created numeric entities out of them.
HTML Tidy will convert them to Unicode equivalents. Problem was, it doesn't do it correctly, or I should say doesn't recognize the problem. It will turn & ;#148; into & #igrave; instead of & ldquo; (put spaces in there to stop the posting process from converting them..)
These improperly coded entities tend to choke or confuse every other utility I've tried to throw at 'em (recode, html2text.py) Any ideas how to fix these things short of learning Perl overnight?
posted by Charlie Bucket to computers & internet (9 comments total)
posted by kcm at 6:52 PM on April 20, 2006