ANSI to get this Unicode converted to ANSI
December 31, 2010 8:52 AM Subscribe
Normally when I would convert a Unicode file to ANSI, I just bust open Notepad++ and use it's handy-dandy Convert option, but now I've got some MASSIVE text files that can't be opened by Notepad++!
I basically need to convert these unicode files to ANSI, and I can't open them with Notepad++. They are approaching 1GB in size, and I'm not really sure what the next best option is. I've also tried a WiToAnsi.vbs script I found with the Google (from MS's site), but that script throws an error after getting just a few MBs into the file. My knowledge of vim is limited - I'm going to maybe try OpenOffice next and see if there's a conversion option there. Any experience or helpful suggestions welcome - thanks MeFi!
I basically need to convert these unicode files to ANSI, and I can't open them with Notepad++. They are approaching 1GB in size, and I'm not really sure what the next best option is. I've also tried a WiToAnsi.vbs script I found with the Google (from MS's site), but that script throws an error after getting just a few MBs into the file. My knowledge of vim is limited - I'm going to maybe try OpenOffice next and see if there's a conversion option there. Any experience or helpful suggestions welcome - thanks MeFi!
Best answer: I'm not sure if this is relevant, but I dealt with something similar years ago.
I had a file in some weird format. It turned out that each character was represented in two bytes. The first byte is zeros and the second byte is the same as it would be in ascii. So I wrote a very simple perl script to strip out all the empty bytes.
If you'd like a copy of the script I'd be happy to send it over.
posted by DrumsIntheDeep at 9:52 AM on December 31, 2010
I had a file in some weird format. It turned out that each character was represented in two bytes. The first byte is zeros and the second byte is the same as it would be in ascii. So I wrote a very simple perl script to strip out all the empty bytes.
If you'd like a copy of the script I'd be happy to send it over.
posted by DrumsIntheDeep at 9:52 AM on December 31, 2010
On the command line, if you TYPE a Unicode text file, the output will be ANSI
(note any extended characters will be lost)
eg:
TYPE UnicodeFile.txt > ANSIFile.txt
posted by Lanark at 9:52 AM on December 31, 2010
(note any extended characters will be lost)
eg:
TYPE UnicodeFile.txt > ANSIFile.txt
posted by Lanark at 9:52 AM on December 31, 2010
If you have access to scripting languages - like bash, ruby or python - you could use the scripts here.
I'm not a unicode expert, but if your file only contains ANSI characters and you want to convert it to UTF-8 without a Byte Order Mark, I'm pretty sure it doesn't need conversion.
posted by I_pity_the_fool at 11:35 AM on December 31, 2010
I'm not a unicode expert, but if your file only contains ANSI characters and you want to convert it to UTF-8 without a Byte Order Mark, I'm pretty sure it doesn't need conversion.
posted by I_pity_the_fool at 11:35 AM on December 31, 2010
Note that "ANSI" is just an historical term; Windows-1252 (which is probably what you want) was never standardized by ANSI.
posted by Monday, stony Monday at 11:37 AM on December 31, 2010
posted by Monday, stony Monday at 11:37 AM on December 31, 2010
I had a file in some weird format. It turned out that each character was represented in two bytes. The first byte is zeros and the second byte is the same as it would be in ascii. So I wrote a very simple perl script to strip out all the empty bytes.
How does that differ from UTF-16? As I say, I don't know a great deal about unicode, but that sounds awfully like it.
posted by I_pity_the_fool at 11:38 AM on December 31, 2010
How does that differ from UTF-16? As I say, I don't know a great deal about unicode, but that sounds awfully like it.
posted by I_pity_the_fool at 11:38 AM on December 31, 2010
Best answer: There's a handy utility on many linux systems called iconv, it's command line. I didn't see a windows installer but if you're doing a lot it may be worth investigation.
posted by sammyo at 11:40 AM on December 31, 2010
posted by sammyo at 11:40 AM on December 31, 2010
Best answer: You can use iconv by installing cygwin.
posted by Monday, stony Monday at 11:46 AM on December 31, 2010
posted by Monday, stony Monday at 11:46 AM on December 31, 2010
Nthing iconv. It sounds like you're on Windows, in which case the easiest way to get it working is to install cygwin - enter iconv in the search box at the install packages stage, and select the package that contains iconv (it's not in the default packages). After the install it should be a one-liner (example on wiki) to convert any file. Note to access windows paths, you have to cd to something like /cygdrive/c/Users/me/desktop, where c is the DOS drive letter.
posted by benzenedream at 11:52 AM on December 31, 2010
posted by benzenedream at 11:52 AM on December 31, 2010
Best answer: Your terminology is very imprecise. "Unicode" and "ANSI" are not specific encodings. "Unicode" most likely means UTF-16, but could also mean UTF-8 or a number of other encodings that are all part of the Unicode standard. "ANSI" is also meaningless. It probably means code page 1252, assuming you're using an English version of Windows and have the code page set to 1252 (Western European), but it could be any other code page as well. So that would be iconv -f UTF-16 -t CP1252 infile outfile, assuming those assumptions are correct.
posted by Rhomboid at 12:31 PM on December 31, 2010 [1 favorite]
posted by Rhomboid at 12:31 PM on December 31, 2010 [1 favorite]
Best answer: Iconv is what you want, if you are willing to install it.
Also: Neither "Unicode" nor "ANSI" specify character encodings. Unicode is a character set standard for which there are many encodings (the most common being UTF-16 and UTF-8). ANSI is a standards body that has standardized a bazillion things, including the historical ASCII character encoding (which goes by many names since every standards body in the world needed to get in on the action: US-ASCII, ANSI X3.4, ECMA-6, ITU-T T.50 IA5 or IRV, and ISO-646 are the same or nearly the same), but in the Windows world "ANSI" means Windows Code Page 1252, which is a non-standard superset of ISO 8859 Latin-1, which is a standardized superset of US-ASCII (aka ANSI X3.4 etc).
(On preview after checking my references— what Rhomboid said.)
posted by hattifattener at 12:36 PM on December 31, 2010 [3 favorites]
Also: Neither "Unicode" nor "ANSI" specify character encodings. Unicode is a character set standard for which there are many encodings (the most common being UTF-16 and UTF-8). ANSI is a standards body that has standardized a bazillion things, including the historical ASCII character encoding (which goes by many names since every standards body in the world needed to get in on the action: US-ASCII, ANSI X3.4, ECMA-6, ITU-T T.50 IA5 or IRV, and ISO-646 are the same or nearly the same), but in the Windows world "ANSI" means Windows Code Page 1252, which is a non-standard superset of ISO 8859 Latin-1, which is a standardized superset of US-ASCII (aka ANSI X3.4 etc).
(On preview after checking my references— what Rhomboid said.)
posted by hattifattener at 12:36 PM on December 31, 2010 [3 favorites]
Response by poster: Thanks for the input, everyone! You all got me looking in the right direction (can't believe I didn't think to try to manipulate the file in cygwin, though I had never heard of iconv), but the easiest solution was (embarrassingly) to use OpenOffice to open the file, then to save it as a .txt file. I had tried this earlier, but mistakenly used the "Encoded Text" Save option instead of just the "Text" Save option. It takes forever and makes my ancient computer freeze for roughly an hour on each file, but it works! I'll note that it didn't work perfectly - these were delimited text files, and one of the delimiters was changed in the process, but I was able to work around that problem.
I had noticed that when I looked at the files in WinHex that the character I wanted to "keep" was often followed (or preceded - can't remember offhand) by a 00 byte. I was trying to think of how I could possibly get rid of every other byte in the file. I have very limited scripting skills, but I should obviously work on improving them to get myself out of jams like this. I'll probably keep these files around to practice some of the solutions you all have recommended to see if I can fix this in a more "clean" fashion.
Again, thanks for all your help!
posted by antonymous at 12:52 PM on December 31, 2010
I had noticed that when I looked at the files in WinHex that the character I wanted to "keep" was often followed (or preceded - can't remember offhand) by a 00 byte. I was trying to think of how I could possibly get rid of every other byte in the file. I have very limited scripting skills, but I should obviously work on improving them to get myself out of jams like this. I'll probably keep these files around to practice some of the solutions you all have recommended to see if I can fix this in a more "clean" fashion.
Again, thanks for all your help!
posted by antonymous at 12:52 PM on December 31, 2010
Response by poster: Thanks for the additional info on character sets and encodings - I had no idea how noobish my question sounded (though we all noobs at something, I suppose). Character encodings and standards are hard! Let's go shopping!
posted by antonymous at 12:59 PM on December 31, 2010
posted by antonymous at 12:59 PM on December 31, 2010
It takes forever and makes my ancient computer freeze for roughly an hour on each file, but it works!
*legions of command line nerds gnash teeth and die a little inside*
posted by benzenedream at 1:07 PM on December 31, 2010 [1 favorite]
*legions of command line nerds gnash teeth and die a little inside*
posted by benzenedream at 1:07 PM on December 31, 2010 [1 favorite]
Yup, if every other byte is 00, then you almost certainly have UTF-16. Dropping the 00s (and possibly also removing the byte-order-mark at the beginning of the file) will produce valid Latin-1 (and therefore valid CP1252), although if there are any characters which aren't representable in Latin-1 you'll get garbage in their place instead of a warning.
(And I agree, character encodings are really surprisingly hairy, don't feel bad.)
posted by hattifattener at 1:07 PM on December 31, 2010
(And I agree, character encodings are really surprisingly hairy, don't feel bad.)
posted by hattifattener at 1:07 PM on December 31, 2010
Please, for the love of $diety, don't ever go trying to write a script to remove every other byte just because they seem to be zero. That way lies madness, and that is the absolute worst way to approach an encoding issue. For one thing, what if there are code points higher than U+00FF in the file? You will find that if your text has anything like 'smart quotes' or em-dashes it will come out as total garbage, not to even mention non-Latin alphabets. When you see a file that looks like every other byte is zero then use a library or processing program that speaks UTF-16 to convert it to whatever format you want.
posted by Rhomboid at 1:16 PM on December 31, 2010 [2 favorites]
posted by Rhomboid at 1:16 PM on December 31, 2010 [2 favorites]
Response by poster: Rhomboid - that's the exact thought that caused me to post this question. I started down that road and quickly realized that I was bringing an axe to a scalpel fight.
posted by antonymous at 1:25 PM on December 31, 2010
posted by antonymous at 1:25 PM on December 31, 2010
Well then for future reference here is how you'd convert the file using perl the right way:
perl -MEncode=from_to -pe 'from_to($_, "utf-16", "cp1252")' <infile >outfile
posted by Rhomboid at 1:42 PM on December 31, 2010 [1 favorite]
perl -MEncode=from_to -pe 'from_to($_, "utf-16", "cp1252")' <infile >outfile
posted by Rhomboid at 1:42 PM on December 31, 2010 [1 favorite]
Actually, no, that's not the right way. That command will work if you slurp the whole file into memory (by adding the -0777 option) but if you want to do it line-by-line you need
perl -pe 'BEGIN { binmode STDIN, ":encoding(utf16)"; binmode STDOUT, ":encoding(cp1252)" }' <infile >outfile
posted by Rhomboid at 2:08 PM on December 31, 2010 [1 favorite]
perl -pe 'BEGIN { binmode STDIN, ":encoding(utf16)"; binmode STDOUT, ":encoding(cp1252)" }' <infile >outfile
posted by Rhomboid at 2:08 PM on December 31, 2010 [1 favorite]
« Older Weird BSOD on Acer Aspire 3100 Laptop | Help with contracts and such for freelancing? Newer »
This thread is closed to new comments.
posted by blind.wombat at 9:49 AM on December 31, 2010