Japanese PHP/Mysql development tips?
August 28, 2006 6:27 AM   Subscribe

About to create a PHP/MySQL CMS and web site for a client entirely in Japanese, what do I need to know?

The site is essentially a japanese version of an existing ecommerce site with the shopping removed ( it's a product catalog without any ecom ). Due to circumstances beyond my control, I can't reuse the same code the ecom site uses.

I do not speak japanese, but the client will be providing a doc with english and japanese for everything on the site, and will be entering the the product info themselves using the CMS.

I usually roll my own CMS for sites of this size, but I'm considering some templating systems or maybe one of those systems that enforces MVC that the kids are so crazy about nowadays (Symfony looks interesting). I don't know that that makes a difference with my question, but maybe there's something I'm not considering.

I've been doing some reading, and I'm pretty overwhelmed by all the character set discussion. I hadn't really expected there to be more than half a dozen options. Performance is not really a concern since this site is going to be small and low traffic, the primary concern is ease of development and that it works consistently.

So, long preamble done, my questions:
1. From what I've found so far, it sounds like UTF-8 is the character set I should go for. Is this correct? Should I look into other encodings?

2. MYSQL. According to the docs, if I'm using MySQL 4.1 or greater, I can simply set a field to UTF-8 encoding like so: ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;. Anything else I need to do on the mysql end?

3. PHP & HTML. I'm less clear how to get the data from a form field into UTF-8 and send it off to MySQL. On a whim, I did a test, and noticed by default IE and Firefox already do a different encoding (the data from FF ending up in the database looking like this -- & #12506; & #12540; (w/o spaces), and IE's looked like this -- ラ). Presumably I need to set the headers? Is there something I need to put in the FORM tag (does it need to be multipart?). When dealing with the submitted data can I safely just grab the $_REQUEST value and send it off to the database, or is there some transformation I need to do? Similarly, is there anything I need to do with data I have retrieved from the database before displaying it?

Thanks in advance for any advice.
posted by malphigian to Computers & Internet (7 answers total) 1 user marked this as a favorite
1) Yes, whenever possible, use UTF-8.
2) While it's best to set the fields to UTF-8, it's not particularly important because UTF-8 can be temporarily stored in any ASCII-compatible character set (e.g. the MySQL default) without data loss. As long as you treat it like UTF-8 on the presentation end, how you store it just needs to be ASCII-compatible.
3) Setting your HTML charset in a HEAD META tag to UTF-8 will ensure it's sent from browser to server as UTF-8. You don't need anything special in form elements. After it gets to the server, PHP treats everything as ASCII, which, as I said above, is a safe way to handle (though not display or transform) UTF-8.
4) If you're doing any manipulation on Japanese text in UTF-8, you may find useful an article I wrote on converting UTF-8 to arrays of unicode code points and back.
posted by scottreynen at 6:46 AM on August 28, 2006

As to (3), the browser should submit the form with the same encoding that was used for the page. Are you sending your pages with UTF-8 encoding?
posted by sbutler at 6:46 AM on August 28, 2006

It's best to have PHP just treat UTF-8 as ASCII, and not try to use any built-in Unicode support (which it doesn't really have). You might want to extend this policy to MySQL as well. Read this article and keep tabs on which encoding is which yourself.
posted by cillit bang at 7:05 AM on August 28, 2006

Just joining the chorus:

If your page is rendered in UTF-8, info goes in and comes out just fine.

If you don't set the charset correctly in the MySQL table, the information will still go in and come out just fine, but if you look at the data in the table directly (as opposed to via the browser), it will be garbled. From the end-user point of view, everything will be working fine, but from the point of someone poking around the MySQL tables, it'll be unintelligible.
posted by Bugbread at 7:10 AM on August 28, 2006

I agree with what has been said before.

Do you plan on implementing search? Japanese does not separate words with spaces (moreover, there's a "Japanese space" that occupies a different code point than the ASCII space), and conventional search algorithms that work in word-chunks will not work.

The easy solution is to find the search string anywhere in the target string. The fancy solution is to hook into something like namazu, which detects word boundaries for you.
posted by adamrice at 10:46 AM on August 28, 2006

Adamrice: Your link for namazu is borked. Perhaps you meant www.namazu.org?
posted by Bugbread at 12:34 PM on August 28, 2006

urp, yeah, thanks for catching that.
posted by adamrice at 1:18 PM on August 28, 2006

« Older Event Handlers for Dynamically added ASP.NET...   |   What food dehydrator do I purchase? Newer »
This thread is closed to new comments.