Please help me with this encoding issue!
April 2, 2011 11:18 AM   Subscribe

Please help me with this encoding issue! Fun with Python, Unicode, Excel, and the IPA.

Dear MeFites,

I just started learning Python a few months ago for a class (my first-ever attempt at any kind of programming) and was really having a hard time with it until I got inspired by a project of my own. Now I'm so close to being successful and really want to achieve this! But I've run into a character-encoding problem and am finally admitting I'm way out of my depth and need help to solve it. If things I say sound ignorant or naive or confused it's because I am, and would love to be enlightened.

So I have this Python script that is reading UTF-16-encoded plain text files that include IPA (International Phonetic Alphabet) characters, and it picks out some of the lines in each file and writes them into a new plain text file which I think is also UTF-16-encoded (admittedly I am not 100% sure of this and don't know how to check). I chose UTF-16 encoding because I did some reading and learned that Excel should be able to read that without me doing anything special. Ultimately I need to take the new file and put it in Excel and have the IPA characters show up. But what I'm getting is instead the stuff with all the slashes and x's. Here's an example line from the new file:

'H2,5,[\'\\xc9\\x99n\\xc9\\x91p\\xc9\\x99l\', "\\xc2\\xa0\\xc2\\xa0\\xc2\\xa0\\xc2\\xa0(\'n", \'appel)\\xc2\\xa0\\xc2\\xa0\']\n'

Here's the line that came from in the original file:
H2 5 ənɑpəl     ('n appel)  

The part of my Python script that wrote the line from the original file to the new file (where "data" is the thing that opens the file I'm writing into and this is all inside of a few loops that do the part about picking out the lines):

data.write(str(line.split()[0]))
data.write(",")
data.write(str(line.split()[1]))
data.write(",")
data.write(str(line.split()[2:]))
data.write("\n")

So basically on each line that I write, I want the ID number (here, H2), separated by a comma from the item number (here, 5), separated by a comma from the contents of the rest of the line, and then a new line. I recognize that the stuff with the slashes and x's is, well, the Unicode-type-stuff, but I can't figure out where the square brackets and commas and "s are coming from and I can't figure out how to make Excel read it nicely! Do I need to be putting things inside of u's to tell Python it's Unicode? Do I need to tell Python to make it UTF-16 somehow? Or is this maybe just about how I import it into Excel? I looked for UTF-16 in the Excel import wizard but it didn't have it.

I'll keep an eye on the thread for a bit in case I've failed to include any relevant details.
posted by ootandaboot to Computers & Internet (13 answers total) 2 users marked this as a favorite
 
Best answer: Not a Python expert by a long shot, but believe you should open the file for writing as UTF-16:
import codecs
data = codecs.open("blahblah", "w", "utf16");
data.write("Hello");
I would also recommend Notepad++ for checking/converting between the encodings.
posted by azlondon at 11:36 AM on April 2, 2011


I can't help you with a direct answer to your question, but for several weeks I've been learning SQL and dealing with data in IPA. My research suggested Excel docs encoded in UTF-8 should work fine...but no. Every single thing I tried (and believe me, I tried) would not (re)produce the IPA input on a re-open of the file in Excel. fwiw, I got all other programs/apps to comply. I ultimately used TextWrangler and avoided Excel...a temporary solution. :(
posted by iamkimiam at 11:36 AM on April 2, 2011


Best answer: Your original data is not being read in using Unicode. If it were, you would get an error when you use str() on data that contains Unicode-only characters like ones in the IPA. Read the Python Unicode HOWTO.

Instead of using open(filename) you need to import codecs and then use codecs.open(filename, encoding="utf-16") and do the same when you write.
posted by grouse at 11:37 AM on April 2, 2011 [1 favorite]


Best answer: Part of your problem is that line.split()[2:] gives a list. When you call str on a list, you get "[...,...,...]". What do you actually want done with elements two or higher? Do you want them to be separated by commas? Or concatenated?
posted by novalis_dt at 11:39 AM on April 2, 2011


And yes I realize you're working with UTF-16, but I just wanted to point out my experience, with the common problematic element being Excel.
posted by iamkimiam at 11:40 AM on April 2, 2011


Also, your pasted script is not a good way of doing things. It will eliminate any whitespace in the last field—is that really what you want to do? Data isn't a great name for an output file variable.

Instead, I'd do something like this:
import csv

writer = csv.writer(outfile)

for line in lines:
    # split a maximum of twice
    row = line.split(u" ", 2)
    assert len(row) == 3
    writer.writerow(row)

posted by grouse at 11:47 AM on April 2, 2011


Response by poster: Thank you all for your suggestions so far! I imported codec and used open.codec for both reading in the original file and opening the new file (which I promise to change to something more informative than "data"). But now I'm getting this error:

Traceback (most recent call last):
File "", line 3, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 679, in readlines
return self.reader.readlines(sizehint)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 588, in readlines
data = self.read()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 112, in decode
raise UnicodeError,"UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM

I'm reading about BOMs now but am not sure I quite understand what it is I'm supposed to put where. For one thing, I definitely don't know if my UTF-16 encoding is big-endian or little-endian.

Azlondon, I can't use Notepad++ because I'm on a Mac (10.5). Should I look for a Mac equivalent? Is it something I would be editing my script in? (currently just using IDLE).

Grouse, that looks like a great solution to the problem novalis_dt pointed out, so thanks to both of you. I'll try it soon but at the moment just changed the third element to be [2] so it wouldn't be a list but I could still see if the IPA stuff is working.

posted by ootandaboot at 12:06 PM on April 2, 2011


Best answer: The x86 platform is little-endian. You can test this like this:
$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01) 
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs
>>> codecs.BOM == codecs.BOM_UTF16_LE
True
>>> codecs.BOM == codecs.BOM_UTF16_BE
False
>>> codecs.BOM
'\xff\xfe'

posted by grouse at 12:17 PM on April 2, 2011


Response by poster: Thanks Grouse, that helped. I think I'm getting so close! I downloaded TextWrangler and used that to add the right BOM. But now I'm getting a new error instead:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0259' in position 0: ordinal not in range(128)

For what it's worth, u0259 (happens to be schwa) is the first IPA symbol it would try to write, so I'm guessing that's why it would be getting stuck on that particular one. But if I opened both the original file (to read from) and the new file (to write to) using codecs.open and giving it utf16 as the encoding, why would I be getting an error about ascii?
posted by ootandaboot at 12:57 PM on April 2, 2011


Best answer: That's the error you get when you don't open the output file with the correct encoding:
>>> z = codecs.open("test.txt", "w", encoding="utf-16")
>>> z.write(u"\u0259")
>>> z.close()
>>> z = codecs.open("test.txt", "w")
>>> z.write(u"\u0259")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0259' in position 0: ordinal not in range(128)
If you can't figure this out, try to narrow down the code to a short and complete test case, and then test it. Doing that alone can often help you find the problem.
posted by grouse at 1:01 PM on April 2, 2011


Response by poster: Hm...when I open the output file, I can do something like this just fine:
output.write(u"\u0259")

But then, still with the same output file open, it gives me that ascii error when I put in the bit of my script that involves several loops. I wonder if it might have something to do with the use of f.readlines() to read in the original file? I do open the file using codecs.open...

I have to leave my computer for a while now but will hope to figure this out later. So grateful to everyone for their help!
posted by ootandaboot at 1:14 PM on April 2, 2011


In general, the easiest way to avoid Unicode errors in Python is to be sure that at all times, you understand whether you are dealing with unicode objects or plain str objects. When you read from a file (and you're not using codecs), you get plain string objects. You then call their decode method to convert them into unicode. When you want to write to a file, and you have a unicode object, call its encode method.

It's a bit more work than using codecs (maybe). But it's always 100% clear what is going on.
posted by novalis_dt at 1:49 PM on April 2, 2011


Where you've used str(), you need unicode() instead.
posted by cogat at 3:13 PM on April 2, 2011


« Older Would we like Capitol Hill   |   How do I get a career I love? Newer »
This thread is closed to new comments.