<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

      <title>Comments on: How do you deal with Chinese characters that can't be represented in 16 bits?</title>
      <link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits/</link>
      <description>Comments on Ask MetaFilter post How do you deal with Chinese characters that can't be represented in 16 bits?</description>
	  	  <pubDate>Fri, 04 Apr 2008 09:05:41 -0800</pubDate>
      <lastBuildDate>Fri, 04 Apr 2008 09:05:41 -0800</lastBuildDate>
      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>

<item>
  	<title>Question: How do you deal with Chinese characters that can&apos;t be represented in 16 bits?</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits</link>	
  	<description>How are people dealing with &amp;gt;16 bit Unicode code points?  Specifically, in languages like Java, C# and C++, which assume 16 bit characters (I believe), how are you supporting &lt;a href=&quot;http://en.wikipedia.org/wiki/GB_18030&quot;&gt;GB 18030&lt;/a&gt;?  I would suspect that the various languages&apos; methods like substring(), charAt(), operator[], etc can&apos;t be safely used in China.   If your wstring, say, contains a Chinese string, then .size() doesn&apos;t tell you how many characters are in it, right?

On a related note, what interesting Chinese characters require more than &amp;gt;16 bits?  I&apos;m thinking about making a short presentation for my co-workers on this subject and I&apos;d like to have some interesting examples.

(Oh, and I&apos;m going to run any examples by my Chinese colleagues first, so don&apos;t bother trying to make me say &quot;penis&quot; or something in front of my co-workers :-))</description>
  	<guid isPermaLink="false">post:ask.metafilter.com,2008:site.87888</guid>
  	<pubDate>Fri, 04 Apr 2008 08:48:52 -0800</pubDate>
  	<dc:creator>bonecrusher</dc:creator>
	
	<category>unicode</category>
	
	<category>chinese</category>
	
	<category>c</category>
	
	<category>java</category>
	
	<category>i18n</category>
	
	<category>l10n</category>
	
</item>
<item>
  	<title>By: tachikaze</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits#1294775</link>	
  	<description>The unicode term you&apos;re looking for is &amp;quot;surrogate pair&amp;quot; or &amp;quot;surrogate code point&amp;quot;.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.87888-1294775</guid>
  	<pubDate>Fri, 04 Apr 2008 09:05:41 -0800</pubDate>
  	<dc:creator>tachikaze</dc:creator>
</item>
<item>
  	<title>By: bonecrusher</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits#1294788</link>	
  	<description>Right, but if there are surrogate pairs in your wstring, and you choose an unfortunate length for your substr, you&apos;re in trouble, aren&apos;t you?</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.87888-1294788</guid>
  	<pubDate>Fri, 04 Apr 2008 09:15:23 -0800</pubDate>
  	<dc:creator>bonecrusher</dc:creator>
</item>
<item>
  	<title>By: hattifattener</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits#1294818</link>	
  	<description>Yes, but you already have that problem, since Unicode includes combining marks. If you subdivide a string after the base mark and before a combining mark you&apos;ll get unexpected results too.&lt;br&gt;
&lt;br&gt;
The OpenStep string classes have some methods for &amp;quot;give me the range of code points that make up the complete character at this point in the string&amp;quot;, which should work for surrogate pairs as well as combining marks. Dunno if Java has an equivalent.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.87888-1294818</guid>
  	<pubDate>Fri, 04 Apr 2008 09:33:45 -0800</pubDate>
  	<dc:creator>hattifattener</dc:creator>
</item>
<item>
  	<title>By: hattifattener</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits#1294874</link>	
  	<description>(And for &amp;quot;complete character&amp;quot; I probably should write &amp;quot;&lt;a href=&quot;http://www.unicode.org/faq/char_combmark.html#2&quot;&gt;grapheme&lt;/a&gt;&amp;quot;).</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.87888-1294874</guid>
  	<pubDate>Fri, 04 Apr 2008 09:53:40 -0800</pubDate>
  	<dc:creator>hattifattener</dc:creator>
</item>
<item>
  	<title>By: hattifattener</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits#1294881</link>	
  	<description>(no, wait, &amp;quot;grapheme&amp;quot; is a slightl yhigher level concept than I&apos;m looking for. Ehhh, text is hard.)</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.87888-1294881</guid>
  	<pubDate>Fri, 04 Apr 2008 09:56:57 -0800</pubDate>
  	<dc:creator>hattifattener</dc:creator>
</item>
<item>
  	<title>By: burnmp3s</title>
  	<link>http://ask.metafilter.com/87888/How-do-you-deal-with-Chinese-characters-that-cant-be-represented-in-16-bits#1294883</link>	
  	<description>In most languages you can avoid these problems by using regular expressions to do string manipulation instead of methods that involve substring() and charAt().  As long as the language&apos;s regex engine supports Unicode, it will usually have a way to select individual gramphemes regardless of the byte representation.&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.regular-expressions.info/unicode.html&quot;&gt;This article &lt;/a&gt;has a good overview of using regexes with Unicode and the various issues to watch out for.</description>
  	<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.87888-1294883</guid>
  	<pubDate>Fri, 04 Apr 2008 09:58:29 -0800</pubDate>
  	<dc:creator>burnmp3s</dc:creator>
</item>

    </channel>
</rss>
