Why does my Vietnamese text only display correctly in the US?
May 31, 2006 3:16 AM   Subscribe

How does a web browser decide whether it can display a numerical character reference?

I have a web page that is available in English and Vietnamese. I use numerical character references for the Vietnamese characters that fall outside the ISO-8859-1 character set (which is the document character set). On my machine at home, this seems to work fine. But I hear from users in Vietnam that it's all messed up. I know they have fonts capable of displaying the characters.

Due to database issues beyond my control, I cannot store the Vietnamese text directly in UTF-8. If I change the output document encoding to UTF-8, the characters from character references get mangled even on my machine. What gives? I thought character references were independent of a document's encoding, referring directly to HTML's ISO-10646 roots?

Is there anything I can do, given that these characters must be stored as references?
posted by Nothing to Computers & Internet (5 answers total)
I am trying sending the pages with an encoding of VISCII and it seems to be working, at least on one test case. But why is that necessary?
posted by Nothing at 4:13 AM on May 31, 2006

Did you use a default font that they likely have, but which might not have Vietnamese characters? Find out a standard display font for Vietnamese characters and use that as your first choice and whatever you have now as a fallback font?
posted by beerbajay at 4:19 AM on May 31, 2006

It would help to have a link to either the page in question, or some sample characters/entities that aren't coming through properly. At least an excerpt of a page that isn't working right. If you have trouble pasting that excerpt usably into the MeFi textarea, send it to my email address in plaintext.

Some things that I check with these problems on my own sites (though I don't use entities):

- does the server's announced Content-type encoding match the header's encoding declaration? (validator.w3.org will tell you, as will lynx -head -dump http://yoursite.mil/yourpage.xml) You should probably be declaring an encoding in your header. But I don't think you or I know what that encoding should be yet.

- I think the failure upon explicitly setting UTF-8 suggests that you're not actually giving UTF8.

- are these entities in an 8-bit VISCIIish encoding, or UTF, or what? Are there other encodings present on the page?

- what was your reference for encoding these entities?
posted by xueexueg at 7:05 AM on May 31, 2006

It might be that you're using Windows XP, and they're using some other operating system (e.g. Windows 98, Linux) that doesn't come with that font installed.

It might be that your server sends the wrong HTTP header. Internet Explorer ignores that header, but any other browser will show it incorrectly if the header's wrong.
posted by Sharcho at 12:37 PM on May 31, 2006

I thought character references were independent of a document's encoding, referring directly to HTML's ISO-10646 roots?

Hmmm. I think you've got that completely the wrong way around.

This is why UTF-8 was invented, precisely because that isn't true.

Setting the page encoding is necessary, pre-Unicode, because there aren't enough entities to go around.

So &#< some number>; means one thing in Russian, another in Farsi, and something else again in Thai. Because it means different things in different encodings, you tell the browser which one to use.

Unicode doesn't have that problem -- there are enough numbers that Vietnam can keep a set of them all to itself and not have to share.
posted by AmbroseChapel at 9:41 PM on May 31, 2006

« Older Why do people make a lucky cigarette in their...   |   Consolodating student loans Newer »
This thread is closed to new comments.