Join 3,382 readers in helping fund MetaFilter (Hide)


Accented characters in HTML.
April 9, 2004 12:19 PM   Subscribe

Accented characters in HTML.

I have a Unicode (UTF-8) xhtml page whose content includes various accented French characters (which I copied-and-pasted from another document). I need to change that page’s encoding to Western European (ISO-8859-1). Changing the declared encoding in the meta tag isn’t hard – but I don’t know how to convert the actual characters in the source code.
posted by kickingtheground to Computers & Internet (13 answers total)
 
Here.
posted by gleuschk at 12:37 PM on April 9, 2004


HomeSite has a nifty "Replace special character" feature that does it for you.
posted by heather at 12:37 PM on April 9, 2004


Or just use this one.
posted by rafter at 12:38 PM on April 9, 2004


Assuming you actually want to convert the file encoding, rather than just inserting HTML entities, try Recode, a free, open-source command line tool that converst between character sets. You'll want something along the lines of

recode UTF-8..ISO-8859-1 mydocument.xhtml

I think.
posted by chrismear at 12:46 PM on April 9, 2004


Inserting the HTML entities is a lot wiser though, unless you're certain all the accented characters you're using are in ISO 8859-1.
posted by fvw at 2:33 PM on April 9, 2004


Sure, if you're deploying on unknown, potentially old OS's and web browsers, HTML entities is certainly more likely to not crash and burn if the system doesn't understand those characters.

But of course, turning all the accented characters into HTML entities makes it a complete pain in the arse to edit the document. And, in fact, for reasonably modern web browsers and OS's (and all XML software that claims to comply with the standard), keeping it in UTF-8 is the standard solution, as this is the default character encoding for XML.

The wisest course of action depends a lot on what environment and application you're working on. kickingtheground specifically asked for a way to convert UTF-8 encoding into ISO-8859-1 encoding, so I'm guessing that he/she knows what he's/she's doing.
posted by chrismear at 2:57 PM on April 9, 2004


Actually, we wanted to change the HTML page's encoding to ISO-8859-1, which includes inserting entities, just dropping the characters entirely and not changing to content at all but just accepting the changed meaning (I don't think there's such a thing as an invalid ISO-8859-1 character, as opposed to UTF-8).

</PedantryFilter>
posted by fvw at 3:15 PM on April 9, 2004


chrismear: Recode looks like exactly what I'm looking for. Thanks.
posted by kickingtheground at 3:40 PM on April 9, 2004


Actually, there are about 32 invalid ISO-8859-x characters. They're used by Windows systems (Codepage Somethingorother) although Windows will tell you that it's ISO-8859-1. Lies, lies.
posted by hattifattener at 5:53 PM on April 9, 2004


Oh yes, so there are, 128-159; how odd. Thanks hatti.
posted by fvw at 7:20 PM on April 9, 2004


Okay, now I'm intrigued. I thought I understood this stuff. But I really didn't get your 'PedantryFilter' post, fvw.

"We wanted to change the HTML page's encoding to ISO-8859-1..." Agreed.

"...which includes inserting entities..." Why?

"...just dropping the characters entirely and not changing to content at all but just accepting the changed meaning..." Wuh? Was it supposed to be "not changing the content at all?

Are you saying that we ought to change the <meta... charset=XXX'> declaration, and not change the actual encoding of the file, but just replace out-of-range characters by their appropriate Unicode character entities?
posted by chrismear at 3:10 AM on April 10, 2004


Nope, I was saying that "changing the page's encoding to ISO-8859-1" could be done in a lot of ways. Any action that results in the page being ISO-8859-1 would have "changed the page's encoding to ISO-8859-1". Hence I argued that both plain converting each character from UTF-8 to ISO-8859-1 and replacing all characters in the file (that aren't already ISO-8859-1 with the same meaning) with HTML named or numbered character entities would be correct solutions to kickingtheground's question.
posted by fvw at 12:53 PM on April 10, 2004


Ahhhhh, I'm with you now. Thanks for clearing that up for my sleep-addled mind. ;)
posted by chrismear at 2:38 PM on April 10, 2004


« Older I have quite a few silk blouse...   |  Is the United Nations merely a... Newer »
This thread is closed to new comments.