A Grave Problem with HTML Entities.
April 20, 2006 6:40 PM   Subscribe

Is there a Unix/Mac OS X utility that can will batch fix badly-coded HTML entities?

So I'm trying to convert 7 years of HTML files from hand-coded and hand-managed on Pagemill 3.0 for Windows. They're a mess, but that's okay for the most part. I'm converting them from HTML-esque to XHTML 4.0. They will eventually get redone in a custom XML schema.

HTML Tidy is generally doing a bang-up job, but I'm having trouble with a bunch of files that have poorly implemented HTML entities.

The problem is that Pagemill took the Windows character set and created numeric entities out of them.

HTML Tidy will convert them to Unicode equivalents. Problem was, it doesn't do it correctly, or I should say doesn't recognize the problem. It will turn & ;#148; into & #igrave; instead of & ldquo; (put spaces in there to stop the posting process from converting them..)

These improperly coded entities tend to choke or confuse every other utility I've tried to throw at 'em (recode, html2text.py) Any ideas how to fix these things short of learning Perl overnight?
posted by Charlie Bucket to Computers & Internet (9 answers total)
I'd have to say that given a map data structure of old->new entities, you'd probably be able to learn enough Perl to do it yourself in two lines more quickly than you'd find a utility that works for your specific case if HTML Tidy and friends don't since they're already pretty robust. My vote is for learning a skill and doing it yourself. :)
posted by kcm at 6:52 PM on April 20, 2006

perl -nle 'BEGIN { #data structure setup here; } s#(&.*;)#$new{$1}#g; print;' file.html

it really can be quite simple if you get ahold of perl-fu.
posted by kcm at 6:54 PM on April 20, 2006

Probably they confuse other utilities because they're not really valid; code point 148 in the Windows-1252 character set is indeed a right double quote, and thus would become the 'rdquo' named entity, but the HTML spec says that the numeric reference for that character is 8221 (or x201D). Your best bet will probably be to get a table of the Windows-1252 code points between 128 and 159 (which are the most commonly problematic ones, and one such table is here) and write your own script to translate them to their standard equivalents.

<pedant>there is no such thing as XHTML 4.0</pedant>
posted by ubernostrum at 8:29 PM on April 20, 2006

If you have a list of them and if it isn't very long, it would be pretty easy to fix them using "sed" along with a shell script under UNIX.

In more complex cases you could write a "lex" script, but that's black magic. (It's also extremely powerful. The first "Jive" translator, way back when, was a lex script.)

Those are both standard UNIX utilities which have been part of every UNIX implementation I've ever used going all the way back to 1979, but I don't know if Apple included them in OSX. My guess would be "yes" for sed and "no" for lex, though, since lex is the front end for yacc, and yacc is obsolete. (Aren't UNIX names fun?)
posted by Steven C. Den Beste at 10:11 PM on April 20, 2006

@Steven C. Den Beste:

(On my Mac Mini Running 10.4.6):

ellism-4:~ mgellis$ lex --version
lex version 2.5.4

The man page for lex goes to flex, so I don't know what is up there. yacc is also installed.
posted by mge at 10:37 PM on April 20, 2006

Thanks for the help, guys. Even when I posted after 5pm and all. I went with a sed script, something lightweight & easy to grasp, since I was going to use it anyway to scrub out a lot of extraneous meta tags anyway. When I'm done I'll run it through the sed -> perl translator to show me how else I might've done it.

Sorry about the XHTML 4.0 biz, brain fart. Playing with GNU recode all afternoon must've gotten to me.

FYI, no lex/yacc in OS X.
posted by Charlie Bucket at 10:40 PM on April 20, 2006

for future reference - flex and bison are the gnu equivalents of lex and yacc, and are probably available for osx.
posted by andrew cooke at 9:31 AM on April 21, 2006

FYI, no lex/yacc in OS X.

FWIW, Darwinports includes something listed as "byacc - Berkely yacc".
posted by AmbroseChapel at 5:49 PM on April 21, 2006

Charlie, all 4 or them are on my OSX-MacBook Pro (running Mac OS X 10.4.6):

Kims-MBP:~ kgani$ which lex yacc flex bison
Kims-MBP:~ kgani$

You may have to install developer tools for them to be there, though.
posted by KimG at 7:38 AM on April 22, 2006

« Older For those nine hours a month when Law & Order...   |   Simple Ajax Commenting Script Newer »
This thread is closed to new comments.