Hey Typing Nerds!
August 19, 2011 11:47 AM   Subscribe

I've been using computers for fifteen years at this point, and I still don't understand the difference between UTF-8 and ASCII. For someone who works primarily in Linux and likes to read, write and manipulate text, but doesn't care much about non-standard symbols, what's the functional difference between UTF-8 and ASCII, and what do I need to know about these two systems in order to choose what to save my text files in and how to prevent my documents from getting screwed up?
posted by Apropos of Something to Computers & Internet (9 answers total) 11 users marked this as a favorite
This is all you'll need to get your brain going.

Seriously. It's a really high-level intro to someone who doesn't yet grok the fundamental fact that a character isn't a byte.

Having said that, virtually all GNU/Linux tools are Unicode-compatible. The virtue of supporting many languages means you have to be. Save your files however you like; they'll correctly put the byte order mark, etc., and all that in for you and, unless you're writing code, you'll likely never know the difference.
posted by introp at 11:51 AM on August 19, 2011 [7 favorites]

The article introp linked seems to cover everything.

Super-short tl;dr version: For plain English text with no fancy characters, UTF-8 and ASCII are identical. Once a Unicode character shows up, the UTF scheme encodes it in a way that makes for good space efficiency but that breaks the "every character is 8 bits" scheme used by ASCII. For this reason it's important that your text manipulation software, whatever it may be, recognizes UTF-8 or text will get mangled when you edit it.

To facilitate this there's the optional "byte-order mark" at the beginning of a UTF-8 file. In HTML you don't want to use this (because the character-set is defined in the header tags anyways) but for other text files you probably do...
posted by neckro23 at 12:14 PM on August 19, 2011

We say ASCII, but what are we really talking about? 8bit, 1 byte, 256 characters. Now, what is assigned to the first 128 slots (characters 0 - 127) is pretty well established. Some control characters, numbers, the english latin alphabet, and some basic punctuation.

But what goes in the last 128 slots is a toss up. ASCII, ANSI, MacRoman... these are just the names to some of the schemes for assigning to the last 128 slots. With DOS you had border and frame glyphs, with MacRoman various accent characters using for european languages.

Let's take a minute to think about Unicode. It's a ton of slots, tens of thousands, that have been assigned to every "character"/"glyph" used by languages we want to type on the computer. Take Japanese for instance. The basic "character" set that people need to know to read a newspaper is 2,000. You can't fit that in ASCII, assigning 1 byte for each character. You could do it assigning 2 bytes for each character if you wanted, but computers have for decades been taught to think of strings as a sequence of 1 byte characters.

Suppose you're counting to 3 in Japanese... ichi, ni, san. And since we're doing our own 2 byte character set, let's give ichi = \x0001, ni = \x0002, san = \x0003. So counting looks like an array of:

{ \x0001, \x0002, \x0003, \x0000 }

We'll still use the \0 byte as an end marker. And this is all fine and good as long as you're always using string routines that expect 2 byte characters. But what happens when you accidentally pass your 2 byte character string to a 1 byte routine? The 1 byte routine sees this (on big endian machines):

{ \x00, \x01, \x00, \x02, \x00, \x03, \x00, \x00 }

That routine is going to think your string is empty because it sees the end marker as the first character!

What we really want is to store Japanese, or all of Unicode, safely in 1 byte characters. That's exactly what UTF8 does: it safely uses multiple bytes to encode a single Unicode character. And how does it do this? By using only those disputed upper 128 slots!

Another digression: what's a multibyte encoding? You use them every day you probably just don't realize it. Take email: most email systems will refuse to handle messages that contain any non-printable character. So how do we send binary attachments? We base64 encode them! It takes each of the bytes in your binary and assigns it to a sequence of characters from a set of 64 printable ones (A-Z, a-z, 0-9, plus some punctuation).

URLs have a limited set of characters that can appear in them. Space being the most famous one to be escaped. When you want to escape a character, you encode a multibyte sequence of "%XX" where XX is the hex of the character number. So space becomes %20.

Write any HTML? Less than and greater than have special meaning, so they need to be escape. And we do it with a multibyte sequence: < and >

UTF8 and ASCII are the same for the first 128 slots. UTF8 uses the upper 128 slots to encode a single Unicode character as a multibyte sequence. So look at the fancy "résumé". It contains two characters that aren't in the basic english latin alphabet. UTF8 says that we make an "é" by using the sequence "\xC3\xA9". So when we encode "résumé" its raw bytes are "r\xC3\xA9sum\xC3\xA9". It we were doing MacRoman this would be "r\xF9sum\xF9" because MacRoman has a different definition about what happens with the higher slots.

Make any sense? Programmers like UTF8 because it has some other nice features. Suppose I sent you: "r\xC3\xA9sum\xC3". If you knew the details of UTF8 you would know that I left a byte off the end and that the sequence is corrupted (the first byte of a multibyte sequence tells you how many more bytes follow). Also, if I sent you "\xA9sum\xC3\xA9" you would know that you're missing bytes from the start (every non-initial byte of a multibyte sequence has a way of indicating that it is not the initial byte).

Not all encoding schemes have this feature. Shift_JIS being the most famous example (early scheme for encoding Japanese that's still in use). Take the URL percent encoding above, if I send you "20" you'd have no way of knowing if I meant "20" or if it's part of "%20".
posted by sbutler at 12:32 PM on August 19, 2011 [3 favorites]

You may also be interested in the different ways of representing the end of a line of text.
posted by exogenous at 12:35 PM on August 19, 2011

Best answer: ASCII is effectively a subset of UTF-8.

UTF-8 is an encoding for the Unicode character set, which is much larger than the ASCII character set: it has codes assigned for basically every writing system ever devised, including things like Chinese that have thousands of different characters.

But the UTF-8 encoding was specifically designed so that the representation of any Unicode character that's also in ASCII is the same as its usual ASCII representation.

So, any ASCII text file is also a UTF-8 text file. The reverse is not the case: a UTF-8 file full of Chinese characters, or even one that's completely in English but uses occasional accent marks (maybe it's about Pokémon), would not be ASCII.

(There's one thing that confuses the issue: a lot of older apps use some sort of "extended ASCII" and don't draw a firm distinction between this and standard ASCII. These extensions are 8-bit codes: they store each character in one byte, and use every bit of it. ASCII is a 7-bit code that simply doesn't use the high bit in the byte, and UTF-8 takes advantage of this by using that unused bit to indicate whether the remaining 7 bits are an ASCII character or part of a multi-byte Unicode encoding. So if you type the word "Pokémon" into an older text editor, and choose to "save as ASCII", you might wind up with something that isn't valid UTF-8. Newer apps will typically complain that the file can't be saved as ASCII and ask you what you want to do about it.)
posted by baf at 12:47 PM on August 19, 2011 [1 favorite]

Also note that, although UTF-8 covers the entire Unicode character set, it's really designed for text that's mostly ASCII, and can be inefficient for other things. To use Chinese as an example again, the most common Chinese characters take up 3 bytes in UTF-8, but only 2 in UTF-16, another Unicode encoding.
posted by baf at 1:05 PM on August 19, 2011

Best answer: how to prevent my documents from getting screwed up?

Any time a program examines data, it must know what encoding that data is in. Every time. Without exception. Documents get screwed up when this does not happen, either because the encoding was specified incorrectly, or because it wasn't specified at all and the program had to guess or assume or fall back on a default.

For example, a web server should specify the encoding of the document that it's sending to the browser, or failing that the document should internally specify its encoding (and hope that the browser is smart enough to be able to sniff the beginning part of the document to look for this crucial tidbit before starting to actually parse anything.) You get problems when this doesn't happen. "Text files", whatever that means, are particularly problematic because there is no metadata supplied with the file telling its encoding; it could be anything. The only way to prevent corruption is to know ahead of time what encoding was used to save the file before opening it. There are some shortcuts here like byte order marks, but they only apply for unicode encodings, and many problems arise from conversion between non-unicode and unicode encodings.

And just so it's clear, this is not "ASCII vs UTF8". There are more encodings than you can count -- run iconv -l | less for a decent list. For English text, the most common encodings you'll encounter are ASCII, ISO-8859-1, CP1252, and UTF-8. In all four of these encodings, standard things like (unaccented) letters and numbers are all the same byte values. And this is why it's so easy to get lazy and not worry about encoding, because a file that consists entirely of standard English letters, punctuation, and newlines will look identical in all four of those encodings, so it's tempting to just ignore the distinction. It's only when you start to encounter things like letters with accents/tildes/carons or currency symbols or fancy double quotes that these encodings diverge and suddenly you realize that there is a difference. Is the letter ñ one byte (0xf1 - ISO-8859-1, CP1252) or is it two bytes (0xC3 0xB1 - UTF-8 or 0xf1 0x00 - UTF-16LE or 0x00 0xf1 - UTF-16BE)? The only way to prevent data loss is to know what encoding every program expects/supports, and make sure they're always speaking the same thing.
posted by Rhomboid at 3:14 PM on August 19, 2011

Even more simply: every character in a computer is represented as a number. We know that because computers only know numbers. So when you save a file, what really goes into the computer is this (skipping the binary step):

002 012 023 127 012 054 063 079 083 047

So as a computer reads a file, it just sees that string of different numbers. In order to convert those numbers into letters, question marks or actual numbers, it needs to know what rules were used to convert the original letters into numbers. ASCII, EBDIC, UTF, etc, are those different types of schemes. When I type an "a" or a period or a ®, the computer sees my keystroke, knows what I'm trying to say, and records the correct number. As long as the computer on the other end has some way of knowing what scheme my computer used to translate the characters into numbers, what went in comes back out again.

So, the most important thing is to pick one and use it consistently and remember which choice you made.
posted by gjc at 5:09 PM on August 19, 2011

Nth-ing the Joel On Software article that introp posted at the very top. I remember reading that a few years ago and finally understanding the importance of those character encoding clauses in HTML docs.

glc, I think you need to read that, to get an understanding of why this is actually a complicated question. There's nothing wrong with your answer, but I think it was aiming too low.
posted by intermod at 8:02 PM on August 19, 2011

« Older Creating a Google map without the map part   |   People Who Have Come Back from the Brink Newer »
This thread is closed to new comments.