Which special character codes?
April 12, 2011 3:06 AM   Subscribe

What code standard should I be using for special characters in html?

The website I work on seems to use a mish-mash of coding standards for displaying special characters. For example, we are currently using both & # 1 7 4 ; and & r e g ; for the registered trade mark symbol. I'm fairly sure the former is ASCII as I've picked up a lot of these codes over the years, but I'm not sure of the terminology for the other (I've heard people use Unicode, HTML and ISO all to refer to what looks, to me, like the same thing).

Is one set of codes generally held up to be best practice in web design? Does it matter that we are using different sets? And does it make any difference that we promise backward compatibility (including with IE6)?
posted by londonmark to Computers & Internet (4 answers total) 7 users marked this as a favorite
 
Best answer: This is a complicated topic. The main thing you need to know is that you can choose any encoding that you like, but whatever you choose you have to inform the browser of that choice. You do that through either the HTTP Content-Type header or in the HEAD element using a META tag.

These are examples of encodings:
  • UTF-8
  • Code Page 1252 aka cp1252 aka Windows-1252
  • latin-1 aka ISO-8859-1 aka "Western"
  • GB1232
  • Shift JIS
An encoding is a set of rules that tell you how to map character numbers to sequences of bytes. Because these different mappings exist, it is crucial that the sequences of bytes match the advertised encoding, otherwise you get mojibake.

Unicode is a broad term that describes many concepts, one of which is number of different encodings (UTF-8, UTF-16, etc.) Unicode was invented as a way to describe every possible grapheme for every language, and Unicode encodings allow this -- they are universal. Encodings like cp1252 and latin-1 are not Unicode and only let you display a certain subset of characters.

When you talk about things like &​#174; or ® those are not encodings, those are HTML entities, either numeric or named. They are not ASCII. The HTML entity &​#0032; means the character at Unicode code point 32 in decimal. The HTML entity &​#x0020; means the character at Unicode code point 20 in hex, which is commonly written as U+0020. Some code points also have names, but not all. Wikipedia has a list.

So that means for any given character there are four ways that it can be represented in a document:
  1. As its native representation in whatever encoding the document is in
  2. As a numeric decimal HTML entity
  3. As a hexidecimal numeric HTML entity
  4. As a named HTML entity
The problems that people have with encoding can be summed up in the following cases:
  • When the document's encoding is not explicitly stated and the browser has to guess
  • When the actual encoding does not match what was presented in the Content-Type header or META tag
  • When the document contains a mixture of encodings
Using HTML entities is one way to avoid the encoding problem, but it can still bite you if your site accepts user data in any way, e.g. through form fields. If that is the case you have to be vigilant to properly detect what encoding the user's data is coming in as and properly convert it to the encoding that you've decided to use internally. For example, if you run a blog that's publishing in UTF-8 and someone submits a comment on a Windows system that was copied and pasted from a Word file that did "smart quotes", you'll probably receive that comment encoded in CP1252. If you just add that to the database verbatim then when you go to render your page to users you'll be publishing a page that claims to be encoded in UTF-8 but which has some content encoded in CP1252. This is how you get mojibake. The right way to handle this is to recognize that the user's form data came in with a different encoding and convert to UTF-8 as soon as possible. You can't just accept user data verbatim without sanitization and verification. Web browsers have many heuristics to deal with this situation of mixed encodings or mismatched encodings, so it may seem like everything is fine even when it isn't. Running your generated HTML through a validator will catch them, however.

This short essay by Joel Sposky goes into a little more detail.
posted by Rhomboid at 4:46 AM on April 12, 2011 [20 favorites]


(er, Spolsky)
posted by Rhomboid at 4:46 AM on April 12, 2011


They are Character entity references; an old SGML thing.
posted by scruss at 4:48 AM on April 12, 2011


Response by poster: Thanks Rhomboid, that's amazingly helpful, if a little over my head in parts! I know exactly where I need to start now.
posted by londonmark at 5:41 AM on April 12, 2011


« Older How was your recovery from CT surgery?   |   Too fat to ride? Newer »
This thread is closed to new comments.