How do you deal with Chinese characters that can't be represented in 16 bits?
April 4, 2008 8:48 AM   Subscribe

How are people dealing with >16 bit Unicode code points? Specifically, in languages like Java, C# and C++, which assume 16 bit characters (I believe), how are you supporting GB 18030? I would suspect that the various languages' methods like substring(), charAt(), operator[], etc can't be safely used in China. If your wstring, say, contains a Chinese string, then .size() doesn't tell you how many characters are in it, right? On a related note, what interesting Chinese characters require more than >16 bits? I'm thinking about making a short presentation for my co-workers on this subject and I'd like to have some interesting examples. (Oh, and I'm going to run any examples by my Chinese colleagues first, so don't bother trying to make me say "penis" or something in front of my co-workers :-))
posted by bonecrusher to Computers & Internet (6 answers total) 4 users marked this as a favorite
The unicode term you're looking for is "surrogate pair" or "surrogate code point".
posted by tachikaze at 9:05 AM on April 4, 2008

Response by poster: Right, but if there are surrogate pairs in your wstring, and you choose an unfortunate length for your substr, you're in trouble, aren't you?
posted by bonecrusher at 9:15 AM on April 4, 2008

Yes, but you already have that problem, since Unicode includes combining marks. If you subdivide a string after the base mark and before a combining mark you'll get unexpected results too.

The OpenStep string classes have some methods for "give me the range of code points that make up the complete character at this point in the string", which should work for surrogate pairs as well as combining marks. Dunno if Java has an equivalent.
posted by hattifattener at 9:33 AM on April 4, 2008

(And for "complete character" I probably should write "grapheme").
posted by hattifattener at 9:53 AM on April 4, 2008

(no, wait, "grapheme" is a slightl yhigher level concept than I'm looking for. Ehhh, text is hard.)
posted by hattifattener at 9:56 AM on April 4, 2008

In most languages you can avoid these problems by using regular expressions to do string manipulation instead of methods that involve substring() and charAt(). As long as the language's regex engine supports Unicode, it will usually have a way to select individual gramphemes regardless of the byte representation.

This article has a good overview of using regexes with Unicode and the various issues to watch out for.
posted by burnmp3s at 9:58 AM on April 4, 2008

« Older Help me find the right print server device for my...   |   I'm going to beat the hell out of something, can... Newer »
This thread is closed to new comments.