How to get PHP to stop 'mis-rendering' some entities like ™ and single smart quotes.
November 28, 2005 6:22 PM   Subscribe

WebDevFilter: ™, ', and ' walk into my database, but only ô, ë, and í come back out.

I'm pulling a passage of text for a web site out of a mySQL database using PHP. Pretty generic stuff.

This particular passage has the ™ character sprinkled throughout it. In the phpMyAdmin tool, I can see the ™ with no problems. In the form I built to edit this text, the ™ shows up just fine as well. However, when the passage is pulled from the database and echoed, every ™ turns into a ô. Also, 'smart' single quotes are getting converted to ë and í, respectively. There may be other conversions going on, but I haven't noticed them yet.

FWIW, the only thing the form does is run the string through htmlspecialchars() and addslashes() before inserting back to the db.

What am I doing wrong?
posted by Wild_Eep to Computers & Internet (12 answers total) 1 user marked this as a favorite
Specify a charset in the Content-Type header of every page. Any encoding will do, but iso-5589-1 is a good place to start.
posted by cillit bang at 6:25 PM on November 28, 2005

Just any encoding will not do, as Wild_Eep has found out.

If it works in phpMyAdmin then go to the main page and look where it says "Language". Take that value as the encoding, but remove the first part. For example, on an older setup mine says "en-iso-88591". The encoding would then be "iso-88591". On a newer setup it says "en-utf-8" and the encoding would be "utf-8".
posted by sbutler at 6:35 PM on November 28, 2005

(sorry, make that "iso-8859-1")
posted by sbutler at 6:36 PM on November 28, 2005

Also, make sure that the data entered into the MySQL db has been done so in an encoding-safe manner. As I learned earlier this year, running INSERT statements via the console causes some sort of problem in an otherwise properly-configured table, whereby the data looks UTF-8 encoded when viewed within the console but certainly isn't once you've queried that data via PHP.
posted by Danelope at 6:56 PM on November 28, 2005

Danelope, while that's good advice, it's irrelevant in this case. The fact that it's displaying okay in phpMyAdmin indicates that it went into mySQL okay and PHP is retrieving it okay. Assuming the browser encoding isn't being changed, the only culprit left is the HTML encoding.
posted by scottreynen at 7:10 PM on November 28, 2005

Best answer: 1. Don't use addslashes()! Use mysql_real_escape_string().

2. Send a utf-8 header from php before you send any of the page's content: header("Content-type: text/html; charset=utf-8");

3. As soon as you connect to mysql, do a mysql_query("SET NAMES 'utf8'"); to set the connection's encoding to utf-8, which is often necessary in php/mysql apps.

4. You want this meta tag in the <head> section to be absolutely safe:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

5. Good luck :-)
posted by evariste at 7:30 PM on November 28, 2005

Just any encoding will not do, as Wild_Eep has found out.

Browsers [usually] submit form data in the same encoding as the page the form was on. As long as the page with form used for entering the text and the page used for displaying it report the same encoding, you should be OK.

(But in the general sense, you're right)
posted by cillit bang at 7:56 PM on November 28, 2005

cillit_bang: In this case, the important thing is that the form uses the same encoding as the phpMyAdmin app.
posted by mbrubeck at 8:41 PM on November 28, 2005

To elaborate a bit, the reason why I said it's often necessary to mysql_query("SET NAMES 'utf8'"); in unicode php/mysql apps is that it's already set on the server level, in the my.cnf file. But if it isn't, and you don't have root access to the server, you need to do it yourself every time you connect to mysql by running that query.
posted by evariste at 9:23 PM on November 28, 2005

is that it's already set
is that it's sometimes already set
posted by evariste at 9:24 PM on November 28, 2005

Response by poster: Wow. It's all working just fine now!

I followed evariste's advice:

I dropped addslashes() in favor of mysql_real_escape_string() (I'm not precisely clear on why it's better, but I can't argue with the results.)

I added the header line at the top of my php file.

I added the mysql_query line to my standard include file which connects the the db for me.

I put the meta tag at the top (I was using before, for the sake of full disclosure. It's what my clients output from Quark 6.5 spit out.

Thanks to all posters, you saved me some significant frustration!
posted by Wild_Eep at 9:36 PM on November 28, 2005

Wild_Eep, the mysql_real_escape_string() thing had nothing to do with your problem, it was just a best practice. addslashes() escapes strings for php's use, but mysql_real_escape_string escapes them for mysql's use. For example, mysql like you to escape \r and \n, which addslashes doesn't escape. In future, mysql might require other things to be escaped. In general, it's safer to use the database driver's escape function and not php's. Right now it doesn't affect much but it might in the future.

And as I mentioned above, you can set the connection's charset to utf-8 in my.cnf if you have root access to the mysql server, to avoid the (quite tiny) performance penalty of setting it on each pageload. You want default-character-set=utf8 in the [client] section of my.cnf.
posted by evariste at 9:49 PM on November 28, 2005

« Older Good ballet slippers to buy?   |   public domain optical illusions Newer »
This thread is closed to new comments.