Please help me with this encoding issue!
April 2, 2011 11:18 AM Subscribe
Please help me with this encoding issue! Fun with Python, Unicode, Excel, and the IPA.
Dear MeFites,
I just started learning Python a few months ago for a class (my first-ever attempt at any kind of programming) and was really having a hard time with it until I got inspired by a project of my own. Now I'm so close to being successful and really want to achieve this! But I've run into a character-encoding problem and am finally admitting I'm way out of my depth and need help to solve it. If things I say sound ignorant or naive or confused it's because I am, and would love to be enlightened.
So I have this Python script that is reading UTF-16-encoded plain text files that include IPA (International Phonetic Alphabet) characters, and it picks out some of the lines in each file and writes them into a new plain text file which I think is also UTF-16-encoded (admittedly I am not 100% sure of this and don't know how to check). I chose UTF-16 encoding because I did some reading and learned that Excel should be able to read that without me doing anything special. Ultimately I need to take the new file and put it in Excel and have the IPA characters show up. But what I'm getting is instead the stuff with all the slashes and x's. Here's an example line from the new file:
'H2,5,[\'\\xc9\\x99n\\xc9\\x91p\\xc9\\x99l\', "\\xc2\\xa0\\xc2\\xa0\\xc2\\xa0\\xc2\\xa0(\'n", \'appel)\\xc2\\xa0\\xc2\\xa0\']\n'
Here's the line that came from in the original file:
H2 5 ənɑpəl ('n appel)
The part of my Python script that wrote the line from the original file to the new file (where "data" is the thing that opens the file I'm writing into and this is all inside of a few loops that do the part about picking out the lines):
data.write(str(line.split()[0]))
data.write(",")
data.write(str(line.split()[1]))
data.write(",")
data.write(str(line.split()[2:]))
data.write("\n")
So basically on each line that I write, I want the ID number (here, H2), separated by a comma from the item number (here, 5), separated by a comma from the contents of the rest of the line, and then a new line. I recognize that the stuff with the slashes and x's is, well, the Unicode-type-stuff, but I can't figure out where the square brackets and commas and "s are coming from and I can't figure out how to make Excel read it nicely! Do I need to be putting things inside of u's to tell Python it's Unicode? Do I need to tell Python to make it UTF-16 somehow? Or is this maybe just about how I import it into Excel? I looked for UTF-16 in the Excel import wizard but it didn't have it.
I'll keep an eye on the thread for a bit in case I've failed to include any relevant details.
posted by ootandaboot to computers & internet (13 answers total) 2 users marked this as a favorite
import codecs data = codecs.open("blahblah", "w", "utf16"); data.write("Hello");I would also recommend Notepad++ for checking/converting between the encodings.posted by azlondon at 11:36 AM on April 2, 2011