HTML Question mark bullets
February 4, 2010 7:34 PM Subscribe
Why does my HTML page render with small diamond-shaped question-mark bullet characters replacing many spaces in the text? They're like a plague of rats in my page.
Those probably aren't spaces, they're some exotic character that just happens to be invisible in your usual viewing method. Browser? Platform? Link to offending page?
posted by rokusan at 7:37 PM on February 4, 2010
posted by rokusan at 7:37 PM on February 4, 2010
My guess is that you used the abomination Microsoft Word at some point in your workflow. It'll look just perfect using standard Windows text fields and editors... until it comes time to render it with a browser.
I can't count how many localization translations I've had ruined because somebody ignored my express instructions not to use MS Word at any point in their translation process.
posted by Netzapper at 7:50 PM on February 4, 2010
I can't count how many localization translations I've had ruined because somebody ignored my express instructions not to use MS Word at any point in their translation process.
posted by Netzapper at 7:50 PM on February 4, 2010
This will happen sometimes when you have the wrong character encoding. In Firefox 3.0 you can adjust the character encoding with View > Character Encoding. Common character encodings are ISO-8859-1 and UTF-8. But you should really provide browser, platform, and URL if you want more help with this.
posted by grouse at 7:51 PM on February 4, 2010
posted by grouse at 7:51 PM on February 4, 2010
That's the "Unicode replacement character" which is rendered when an page is sent with headers or meta-tags that tell the browser that it is encoded in UTF-8 but the actual page data includes bytes that are not legal in UTF-8 encoded data. Since ASCII data is all legal UTF-8, this is usually the Windows CP1252 encoding, which uses values like 0x92, for apostrophe, or 0x96, for the dash.
You can add a meta tag to declare the Windows encoding to the HEAD element of the page, like this:
but that may not work for users who aren't using Windows. You can recode your page into UTF-8 like this:
posted by nicwolff at 8:29 PM on February 4, 2010 [1 favorite]
You can add a meta tag to declare the Windows encoding to the HEAD element of the page, like this:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
but that may not work for users who aren't using Windows. You can recode your page into UTF-8 like this:
perl -MEncode -ne 'print Encode::encode("utf-8", Encode::decode("cp1252", $_))' <> recoded.html
>
posted by nicwolff at 8:29 PM on February 4, 2010 [1 favorite]
Whoops, forgot to escape the "<". And, the Encode module exports those functions by default, so you can just do:
posted by nicwolff at 8:34 PM on February 4, 2010 [1 favorite]
perl -MEncode -ne 'print encode("utf-8", decode("cp1252", $_))' < original.html > recoded.html
posted by nicwolff at 8:34 PM on February 4, 2010 [1 favorite]
(If you are on Windows computer without Perl installed, you can get it here.)
posted by nicwolff at 9:12 PM on February 4, 2010
posted by nicwolff at 9:12 PM on February 4, 2010
nicwolff is on the right track. This has very little to do with MS Word, and everything to do with a mismatch in character encodings. That is, the browser thinks you're giving it one character encoding (possibly because of a meta-tag or server configuration), whereas your source file is saved using a different encoding.
More details on the authoring workflow would help us pinpoint the problem.
posted by Jacen Solo at 10:18 PM on February 4, 2010
More details on the authoring workflow would help us pinpoint the problem.
posted by Jacen Solo at 10:18 PM on February 4, 2010
Easiest way to get rid of that is just take the text and paste it into a flat file editor (notepad or any sensible code editor). Some editors (TextPad for instance) have an option for plain text only. Copy it back out and use the new text instead. Next time don't use Word.
posted by sophist at 11:13 PM on February 4, 2010
posted by sophist at 11:13 PM on February 4, 2010
nicwolff is on the right track. This has very little to do with MS Word, and everything to do with a mismatch in character encodings. That is, the browser thinks you're giving it one character encoding (possibly because of a meta-tag or server configuration), whereas your source file is saved using a different encoding.
Well, yes and no. It's an encoding mis-match, but on top of that the Windows CP1252 encoding assumes the presence of a character set that is not universally installed except on Microsoft Windows, so merely sending the right header will just show many of your site's visitors "?" characters in place of the replacement characters.
The right thing to do is convert the text into either ISO-8859-1 (which is the default encoding for "text/html") or, better, UTF-8.
Easiest way to get rid of that is just take the text and paste it into a flat file editor (notepad or any sensible code editor). Some editors (TextPad for instance) have an option for plain text only. Copy it back out and use the new text instead. Next time don't use Word.
Not using Word is always good advice, but he doesn't have to strip out all his non-ASCII characters, some of which may be important — just save them as either ISO-8859-1 or UTF-8 and send the right headers.
smick, if you are using Microsoft Word, I'm sure there's an option to select the UTF-8 encoding for "Save as Web page" — look under "Web Options" or in Word's preferences. Try that first, and if the page still looks weird, try adding a line to the page header like
posted by nicwolff at 1:37 AM on February 5, 2010
Well, yes and no. It's an encoding mis-match, but on top of that the Windows CP1252 encoding assumes the presence of a character set that is not universally installed except on Microsoft Windows, so merely sending the right header will just show many of your site's visitors "?" characters in place of the replacement characters.
The right thing to do is convert the text into either ISO-8859-1 (which is the default encoding for "text/html") or, better, UTF-8.
Easiest way to get rid of that is just take the text and paste it into a flat file editor (notepad or any sensible code editor). Some editors (TextPad for instance) have an option for plain text only. Copy it back out and use the new text instead. Next time don't use Word.
Not using Word is always good advice, but he doesn't have to strip out all his non-ASCII characters, some of which may be important — just save them as either ISO-8859-1 or UTF-8 and send the right headers.
smick, if you are using Microsoft Word, I'm sure there's an option to select the UTF-8 encoding for "Save as Web page" — look under "Web Options" or in Word's preferences. Try that first, and if the page still looks weird, try adding a line to the page header like
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
posted by nicwolff at 1:37 AM on February 5, 2010
Response by poster: Wow - lots of great feedback. Thanks! The problem happens in IE and Firefox, and I'm running Windows XP. I can't link you to the pages - it's a hundred page user guide behind a firewall. MS Word was, indeed, an early part of the workflow. I recall running a cleaner on all the files at one point that was supposed to remove a lot of the crappy Word markup, but there's still plenty in there. I'm using Dreamweaver CS3 to edit the files now.
The character I'm seeing is, as nicwolff links to above, the 'Unicode replacement character'
I'm ready to try your suggestions and I'll report back what I find. Thanks again y'all!
posted by smick at 5:09 PM on February 5, 2010
The character I'm seeing is, as nicwolff links to above, the 'Unicode replacement character'
I'm ready to try your suggestions and I'll report back what I find. Thanks again y'all!
posted by smick at 5:09 PM on February 5, 2010
This thread is closed to new comments.
Look at your HTML source and replace every single quote, double quote, and hyphen with a standard character.
posted by fremen at 7:37 PM on February 4, 2010