Apostrophe Displays as ’
May 31, 2011 5:32 AM   Subscribe

Apostrophe Displays as ’

I get several weekly newsletters by email, formatted as HTML pages. Special characters, such as (I assume) a curly apostrophe display as junk characters. For example, the apostrophe shows as ’ and other characters have similar problems. The three-dot ellipsis becomes … and so on.

Is there a preset in Firefox 4 or Windows 7 to translate these properly?
posted by KRS to Computers & Internet (5 answers total)
 
First, it's handy to know a bit about UTF-8. Skip this if you're already familiar.

UTF-8 uses one or more 8-bit bytes to store a single character, unlike ASCII and friends which use only one byte per character. It is more space-efficient than its cousins (UTF-16, UTF-32) when the majority of the characters can be encoded as a single byte, as is the case with most English text, but with the added benefit that you can still store any character under the sun should you need to. It uses the most significant bits of each byte as continuation bits (to signify that the following byte(s) form part of the same character). It is for this reason that improperly-displayed UTF-8 results in weird characters.

UTF-8 is backwards-compatible with ASCII — all characters up to 127 are identical in both encodings. This at least makes English text legible if the UTF-8 is interpreted incorrectly as ASCII or ISO 8859 character sets. However, it's these incorrect interpretations that cause the odd characters to appear.

Unfortunately, PHP doesn't yet support UTF-8 natively in its numerous string handling functions (version 6 will when released), but that doesn't mean you can't work with it -- you just have to be a bit careful. Let's take strlen() for example: with plain ASCII text, strlen() returns the number of characters in a string. It does this by counting the number of bytes used to hold the data. It doesn't know about (and cannot detect) UTF-8 and will blindly count the number of bytes, not the actual number of characters. Hence, the presence of any multibyte characters in your string will give you an incorrect length.

A problem you will inevitably face is when a user takes advantage of another application to create some text which gets pasted into your HTML form and submitted. Microsoft Word, for example, uses Unicode internally and converts characters like quotes and dashes into "smart quotes" and em- and en-dashes automatically. These are typographically correct, but the symbols lie outside the ASCII character set so when copied and pasted, the text is sent as UTF-8 and you end up with multibyte characters all over the place. If you store this text and later send it back to a browser without informing it that you are sending UTF-8, extra characters will appear.
posted by fozzie33 at 6:04 AM on May 31, 2011 [2 favorites]


It may or may not be fixable. You mention Firefox, so are we to assume this is webmail, and not a standalone MUA? You can try forcing the encoding to UTF-8, but there's a chance that will not work due to the nature of webmail. Unfortunately, email is a wasteland of encoding horrors, so for example if the sender uses UTF-8 but labels it as iso8851-1, and if your webmail provider presents the page in UTF-8, then they're going to translate 8859-1 into UTF-8, which 'sets' the problem because now it's technically not an encoding error any more, i.e. the page is faithfully encoding mojibake in UTF-8. The problem is only correctable if the advertised page encoding does not match the content, i.e. that conversion has not happened yet.
posted by Rhomboid at 6:27 AM on May 31, 2011


If your browser isn't picking up the correct encoding, in Firefox View/Character encoding, in IE right-click/encoding, and set it to utf-8

Should sort out the body content, but if, as Rhomboid says, it's picking it up because the coding of the mail doesn't match the webmail's encoding, you may find that does strange things to the rest.
posted by monkey closet at 6:30 AM on May 31, 2011


Response by poster: Thanks to all. Rhomboid, you're right that this is webmail.

I went into the View/Character encoding menu and it was already set to UTF-8. I can live with it.
posted by KRS at 12:57 PM on June 2, 2011


If you set up a standalone program like Thunderbird or Outlook/Outlook Express with POP/IMAP access you should be able to bypass the re-encoding that the webmail provider is doing for display and force it to the correct one. Also, if the person sending these emails really is sending out UTF-8 but labeling it as iso8859-1 in the MIME headers, then you should email them and tell them they're doing it wrong.
posted by Rhomboid at 1:38 PM on June 2, 2011


« Older How to best dry my engine after a coolant mishap?   |   One Demyelination Event Is Plenty For Me Thanks -... Newer »
This thread is closed to new comments.