Mis-encoded Mayhem!
October 9, 2009 9:25 AM
Need help converting a mySQL database with Latin characters to UTF-8
I have an install of SMF that has its database set to UTF-8, and works for *most* posts. I recently migrated the tables out of one database, and into another, to separate it out from a Joomla install (They had both been in the same DB due to the limitations of the bridging component, but a newer bridge (JFusion) allows me to separate them, so I did for safety and security).
It appears that during the copy from one DB to the other, some of the characters in the posts became corrupt, or simply aren't showing up properly. After working with some developers, we've come to the conclusion that a handful of characters in the body of messages are actually NOT in UTF-8 format, but most likely in ISO-8859-1. You can see a sample post here:
http://www.knightrealms.com/forums/index.php?topic=50.0
Originally, the body of the post wasn't showing at all, but we were able to get it to show with an error (the [?] ) instead. I can confirm other posts, and other fields (such as signatures) are having the same issue, and when I look in the DB at them, they have characters such as curved quotes, Spanish characters, etc.
Is there a way to "clean" my DB to convert the wayward characters into UTF-8? I did a search on Metafilter and found this, but the links are all long gone. I did some googling and found this, but my PHP and mySQL-fu is not good enough to actually implement it - I can set up SMF and Joomla, but I'm not a developer / coder yet.
I tired running SMF's "Convert to UTF-8" function, and while this fixes the existing issues, it then introduces additional issues in the posts that were once fine. The issue appears to be record-specific.
Any help that can point me to a step-by-step process to clean / convert the problem characters would be appreciated!
I have an install of SMF that has its database set to UTF-8, and works for *most* posts. I recently migrated the tables out of one database, and into another, to separate it out from a Joomla install (They had both been in the same DB due to the limitations of the bridging component, but a newer bridge (JFusion) allows me to separate them, so I did for safety and security).
It appears that during the copy from one DB to the other, some of the characters in the posts became corrupt, or simply aren't showing up properly. After working with some developers, we've come to the conclusion that a handful of characters in the body of messages are actually NOT in UTF-8 format, but most likely in ISO-8859-1. You can see a sample post here:
http://www.knightrealms.com/forums/index.php?topic=50.0
Originally, the body of the post wasn't showing at all, but we were able to get it to show with an error (the [?] ) instead. I can confirm other posts, and other fields (such as signatures) are having the same issue, and when I look in the DB at them, they have characters such as curved quotes, Spanish characters, etc.
Is there a way to "clean" my DB to convert the wayward characters into UTF-8? I did a search on Metafilter and found this, but the links are all long gone. I did some googling and found this, but my PHP and mySQL-fu is not good enough to actually implement it - I can set up SMF and Joomla, but I'm not a developer / coder yet.
I tired running SMF's "Convert to UTF-8" function, and while this fixes the existing issues, it then introduces additional issues in the posts that were once fine. The issue appears to be record-specific.
Any help that can point me to a step-by-step process to clean / convert the problem characters would be appreciated!
On the page you posted, the [?] is 0x93, which is not a printable latin-1 or iso8859-1 character, so you might be barking up the wrong tree here. Is that the raw data or have you already futzed with the encoding before it ended up on the site? If the latter, you'll need to reverse whatever you did, because that isn't a problem of the wrong encoding on the page so much as it just being garbage.
I'm trying to figure what character makes sense there, and I can only think of u201c (“), which iso8859-1 can't represent. This makes me skeptical this is an issue of mixup between latin1/utf8.
posted by cj_ at 12:45 PM on October 9, 2009
I'm trying to figure what character makes sense there, and I can only think of u201c (“), which iso8859-1 can't represent. This makes me skeptical this is an issue of mixup between latin1/utf8.
posted by cj_ at 12:45 PM on October 9, 2009
I take that back, this is cp1252 encoding, not latin-1 or iso8859-1.
Try just changing the meta http-equiv on the html page to use this encoding. It's set to utf-8 right now. This would be a lot easier than converting your database.
posted by cj_ at 1:02 PM on October 9, 2009
Try just changing the meta http-equiv on the html page to use this encoding. It's set to utf-8 right now. This would be a lot easier than converting your database.
posted by cj_ at 1:02 PM on October 9, 2009
PS, your html should look exactly like this:
<meta http-equiv="content-type" content="text/html; charset=windows-1252" />
This is what it looks like with the proper page encoding, compared to the original. Of course, if you are using utf-8 elsewhere, you still have a problem. Is this the case?
posted by cj_ at 1:08 PM on October 9, 2009
<meta http-equiv="content-type" content="text/html; charset=windows-1252" />
This is what it looks like with the proper page encoding, compared to the original. Of course, if you are using utf-8 elsewhere, you still have a problem. Is this the case?
posted by cj_ at 1:08 PM on October 9, 2009
I'm trying to GET to UTF-8 - I don't want to be in any other set. I'm trying to change the data to match what I want - not change what I want to match the data.
I'm trying the export thing now... let's see if that works...
posted by GJSchaller at 1:26 PM on October 9, 2009
I'm trying the export thing now... let's see if that works...
posted by GJSchaller at 1:26 PM on October 9, 2009
Yeah I understand, it's just that I'd be surprised if the GUI tool knows how to convert cp1252. If it doesn't work out for you and you're comfortable with some fiddling, you can run this against a mysqldump file and re-import:
$ python convert.py < original_dump.sql > new_dump.sql
Good luck.
posted by cj_ at 3:37 PM on October 9, 2009
#!/usr/bin/env python
import sys
for line in sys.stdin:
sys.stdout.write(line.decode('cp1252').encode('utf-8'))
sys.stdout.flush()
$ python convert.py < original_dump.sql > new_dump.sql
Good luck.
posted by cj_ at 3:37 PM on October 9, 2009
It turned out in the end that a line was left out of the JFusion code in the last round of updates - the symptoms looked like an encoding error, but once someone else reported the same issue, it was more clear it was the PHP, and not just my DB. All is set. Thank you!
posted by GJSchaller at 1:22 PM on October 10, 2009
posted by GJSchaller at 1:22 PM on October 10, 2009
« Older Rock Reggae Punk Funk Alt Goth Metal Thrash, who's... | I did the Lady Di 'do in the '80s. Newer »
This thread is closed to new comments.
What I've done in the past, is to use the power of the free MySQl Administrator software while importing/exporting the database. It has an option to choose the alphabet things have to be exported in. And a bit of fiddling with that always gave me a new database without those coding errors.
posted by ijsbrand at 10:24 AM on October 9, 2009