HTML Question mark bullets
February 4, 2010 7:34 PM Subscribe

Why does my HTML page render with small diamond-shaped question-mark bullet characters replacing many spaces in the text? They're like a plague of rats in my page.
posted by smick to Computers & Internet (11 answers total) 4 users marked this as a favorite

Did you copy and paste from Word? Many times, Word will change standard characters like a single quote or hyphen into more typographically friendly versions. Unfortunately, those don't render well in HTML.

Look at your HTML source and replace every single quote, double quote, and hyphen with a standard character.
posted by fremen at 7:37 PM on February 4, 2010

Those probably aren't spaces, they're some exotic character that just happens to be invisible in your usual viewing method. Browser? Platform? Link to offending page?
posted by rokusan at 7:37 PM on February 4, 2010

My guess is that you used the abomination Microsoft Word at some point in your workflow. It'll look just perfect using standard Windows text fields and editors... until it comes time to render it with a browser.

I can't count how many localization translations I've had ruined because somebody ignored my express instructions not to use MS Word at any point in their translation process.
posted by Netzapper at 7:50 PM on February 4, 2010

This will happen sometimes when you have the wrong character encoding. In Firefox 3.0 you can adjust the character encoding with View > Character Encoding. Common character encodings are ISO-8859-1 and UTF-8. But you should really provide browser, platform, and URL if you want more help with this.
posted by grouse at 7:51 PM on February 4, 2010

That's the "Unicode replacement character" which is rendered when an page is sent with headers or meta-tags that tell the browser that it is encoded in UTF-8 but the actual page data includes bytes that are not legal in UTF-8 encoded data. Since ASCII data is all legal UTF-8, this is usually the Windows CP1252 encoding, which uses values like 0x92, for apostrophe, or 0x96, for the dash.

You can add a meta tag to declare the Windows encoding to the HEAD element of the page, like this:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

but that may not work for users who aren't using Windows. You can recode your page into UTF-8 like this:

perl -MEncode  -ne 'print Encode::encode("utf-8", Encode::decode("cp1252", $_))' <> recoded.html

posted by nicwolff at 8:29 PM on February 4, 2010 [1 favorite]

Whoops, forgot to escape the "<". And, the Encode module exports those functions by default, so you can just do:

perl -MEncode -ne 'print encode("utf-8", decode("cp1252", $_))' < original.html > recoded.html

posted by nicwolff at 8:34 PM on February 4, 2010 [1 favorite]

(If you are on Windows computer without Perl installed, you can get it here.)
posted by nicwolff at 9:12 PM on February 4, 2010

Easiest way to get rid of that is just take the text and paste it into a flat file editor (notepad or any sensible code editor). Some editors (TextPad for instance) have an option for plain text only. Copy it back out and use the new text instead. Next time don't use Word.
posted by sophist at 11:13 PM on February 4, 2010

nicwolff is on the right track. This has very little to do with MS Word, and everything to do with a mismatch in character encodings. That is, the browser thinks you're giving it one character encoding (possibly because of a meta-tag or server configuration), whereas your source file is saved using a different encoding.

Well, yes and no. It's an encoding mis-match, but on top of that the Windows CP1252 encoding assumes the presence of a character set that is not universally installed except on Microsoft Windows, so merely sending the right header will just show many of your site's visitors "?" characters in place of the replacement characters.

The right thing to do is convert the text into either ISO-8859-1 (which is the default encoding for "text/html") or, better, UTF-8.

Easiest way to get rid of that is just take the text and paste it into a flat file editor (notepad or any sensible code editor). Some editors (TextPad for instance) have an option for plain text only. Copy it back out and use the new text instead. Next time don't use Word.

Not using Word is always good advice, but he doesn't have to strip out all his non-ASCII characters, some of which may be important — just save them as either ISO-8859-1 or UTF-8 and send the right headers.

smick, if you are using Microsoft Word, I'm sure there's an option to select the UTF-8 encoding for "Save as Web page" — look under "Web Options" or in Word's preferences. Try that first, and if the page still looks weird, try adding a line to the page header like

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

posted by nicwolff at 1:37 AM on February 5, 2010

Wow - lots of great feedback. Thanks! The problem happens in IE and Firefox, and I'm running Windows XP. I can't link you to the pages - it's a hundred page user guide behind a firewall. MS Word was, indeed, an early part of the workflow. I recall running a cleaner on all the files at one point that was supposed to remove a lot of the crappy Word markup, but there's still plenty in there. I'm using Dreamweaver CS3 to edit the files now.

The character I'm seeing is, as nicwolff links to above, the 'Unicode replacement character'

I'm ready to try your suggestions and I'll report back what I find. Thanks again y'all!
posted by smick at 5:09 PM on February 5, 2010

« Older My chickens! No eggs! WTF? | Save me from expensive vacant home insurance. Newer »

This thread is closed to new comments.

Ask MetaFilter

HTML Question mark bullets
February 4, 2010 7:34 PM Subscribe

Tags

Share

HTML Question mark bullets February 4, 2010 7:34 PM Subscribe

Tags

Share

HTML Question mark bullets
February 4, 2010 7:34 PM Subscribe