What the Â???
September 10, 2007 5:50 PM   Subscribe

Why does this character --- Â --- randomly and occasionally rear its ugly head on my Wordpress blog?

The blog in question is located at http://www.mikecade.com/blog1
posted by iced_borsch to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
Are you copying and pasting from another editor into the text field of your blog? Microsoft word, in particular is notorious for using non-standard characters, which then show up oddly when displayed in HTML.
posted by chrisamiller at 5:57 PM on September 10, 2007


Response by poster: Generally no. I tend to type write into the entry field of my blogging software...
posted by iced_borsch at 6:14 PM on September 10, 2007


What platform are you on, and what platform is hosting the web server? Sometimes if your computer and the server are set to difference locales, little bugs like that can appear.

Also, are you editing it in IE?
posted by TheNewWazoo at 6:16 PM on September 10, 2007


This sort of thing often happens if you're writing in a different character encoding (or copying and pasting content in a certain encoding) and then outputting it in an environment where another encoding is used. Unfortunately a lot of software is too stupid to fix this due to shortsightedness of the creators (and it's really not that hard for software to figure out most character encoding issues). That said, I'm not seeing this on your page, and you have the right META tag for the page being in UTF-8. Possibly you're pasting in stuff from ISO-8859-1 or another common encoding..?
posted by wackybrit at 6:21 PM on September 10, 2007


Response by poster: Also, are you editing it in IE?

IE and occasionally Safari...
posted by iced_borsch at 6:26 PM on September 10, 2007


Best answer: From looking at the raw bytes of your web page I know exactly what the problem is — your page contains the byte sequence (C2A0), which is presumably intended to be the UTF-8 encoding of the "no-break space" character (U+00A0). However, your page is being interpreted by some browsers as if it were in the standard ISO 8859-1 character set, not UTF-8. In ISO 8859-1 the byte sequence (C2A0) is a  character (U+00C2) followed by a "no-break space" character (U+00A0).

Why is your page being interpreted as if it were encoding in the ISO 8859-1 character set, despite your UTF-8 Content-Type declaration? I suspect the cause is the broken HTML at the start of your page. If you examine the source of your page you'll notice that you have a <div> tag and a block of ads at the top of your page, even before your DOCTYPE declaration or <html> tag. If you remove that block of ads (or move it to a syntactically correct location), I think your problem will go away.
posted by RichardP at 6:30 PM on September 10, 2007


I had managed to write half of what was a very detailed explanation of the difference between ISO-8859-1 and UTF-8. I lost it in a freak keyboard accident and am not going to repeat it.

RichardP is nearly right. The character  is 0xC2 in ISO-8859-1. It is also the first byte of the two byte encoding that utf-8 uses to express some characters. I don't see this error on your blog right now (perhaps you fixed it), but I'll try and give an explanation of how this works.

UTF-8 uses a single byte to express all characters with values between 0 and 127. Anything character from ISO-8859-1 that has a value between 128 and 255 requires two bytes to encode it. Anything between 128 and 191 (inclusive) will have a first byte of 0xC2, and if you read it as ISO-8859-1 instead of UTF-8, you will see a Â.

What does this really mean?

This is a Yen: "¥"
This is a Yen, if you read it wrong: "Â¥"

It looks fine, because in ISO-8859-1, a yen is 0xA5. And in UTF-8 a yen is 0xC2A5. So the character looks okay but there's junk at the start.

Nonbreaking space is 0xA0 in ISO-8859-1, and when it comes up in UTF-8 it is 0xC2A0. This will look like "Â ". I used the Yen example above because it's easier to see something rather than just a space.
posted by Jerub at 6:53 PM on September 10, 2007


That's why God (or the HTML committee) gave us ampersand-semicolon encodings for special characters.
&nbsp;
is unambiguously a non-breaking space for all browsers irrespective of the doctype.
posted by Steven C. Den Beste at 8:42 PM on September 10, 2007


Best answer: If you do continue to use special characters without, as Stephen C. Den Beste suggests, ampersand-encoding them, you may want to explicitly state the character encoding by way of a meta tag in the <head> of your document or template, something like:

<meta http-equiv="content-type" content="text/html; charset=UTF-8" />

Incidentally, for other developers who make use of this, I have observed an infrequently-reported (but documented!) "issue" with IE6+7 whereby this declaration is not used unless it occurs in the first 256 bytes of the page.*

*I learned this from revising a client's site on which the content-type meta tag was preceded by a <meta> tag with a huge list of "keywords" (ugh)... so IE was blind to this declaration and rendered the page using ISO-8559-1. Making the "content-type" the first of the <meta> tags solved the problem, in this case.
posted by myrrh at 2:25 PM on September 11, 2007


« Older How to plan for a looong trip?   |   Will I go slowly insane? Newer »
This thread is closed to new comments.