Character Encoding Explanations for the Non-Developer
November 9, 2010 1:40 PM   Subscribe

How do I explain character encoding issues well to very non-technical non-developers?

We are working on a system which, as one of its functions, incorporates a large amount of older content (we're talking static HTML from 1998 all the way to the present day) into a newer (UTF8-based) database/web application, for archiving and searching purposes. A variety of different languages are represented. The problem is that the older content seems to have been written in a variety of older character encodings (not to mention different customized HTML formats, but that's a different issue...) and getting it into UTF8 has been spotty. In some cases it's been run through the wringer enough times that it's pre-garbled, and our encoding-fixing scripts don't clean it well, and we have to ask the clients for new copy. Of course, even running the scripts is often a pretty manual process as we test to see which encoding we are starting with so that we can finally get some good UTF8 output if at all possible.

For other, newer content that the system supports, the clients and people they have been working with keep entering stuff that they've first written in MS Word or whatever and then pasted in (argh), which seems to do all kinds of lovely things with the encoding. That's another issue, and we've written some warnings on the input forms to let them know it'd be better if they didn't do this, and we are writing trying to figure out the best way to filter it,'s all a big pain.

I'm the only developer on the project and my encoding knowledge, while growing, is not amazingly deep; and it also seem to just be one of those things that is tough to solve (or I just need to be more educated on this I guess). But the real point of this question is, how do I explain to them that just because it used to work fine in 1999 when we were using this static HTML and now it doesn't look right doesn't mean that we are moving backwards. How do I help them understand that we're actually progressing nicely by incorporating everything in one system that should (eventually) be able to handle all the different languages they use with ease? They are so impatient and don't seem to get our explanations, partially because they are made nervous by technology, and partially because we are explaining badly.

So: anyone have any good web resources for explaining encoding to non-developer types? I've only been able to find developer-targeted docs as of yet. Also, any particular ways of describing it or looking at it would be welcome.

Thanks folks!
posted by dubitable to Technology (22 answers total) 6 users marked this as a favorite
Fhe FAQ for the Universal Encoding Detector has a pretty good layman's-level explanation of character encoding. BTW, it's almost magical how well it works; you should be using it if you aren't already!
posted by zsazsa at 1:57 PM on November 9, 2010

Response by poster:'s Python, that's why I haven't touched it. I've been doing all this in PHP which was an early requirement (blech) and other stuff when I can Ruby (nice language, but encoding support has been spotty). I'll check it out, thanks zsazsa!
posted by dubitable at 2:02 PM on November 9, 2010

Best answer: That FAQ is an okay start since it keeps it short, but here's where you'll lose the non-techie audience: So you can think of the character encoding as a kind of decryption key for the text. It's not that this is a difficult concept, but by using this you've just explained something they don't understand by referencing something else they don't understand.

I work as the liaison between developers and non-techies. What the clients really want is a clearly-stated, confident answer, so that's you must give them - not a dissertation. Here's my general approach for common tough topics: I come up with a standard short explanation and long explanation for the issue. The short answer must be 3 sentences or fewer. The long one should probably be under 10 sentences. The reasons you have to trim them down so much are a) attention span issues (if the audience stops listening/reading, they won't understand) b) forcing yourself to leave out lots of details that you think are really critical, but aren't actually important for the audience.

So spend 10 minutes dreaming up those standard explanations. Maybe email them to the next exasperated client. Run through the explanation a few times in your head, make it shorter and remove jargon, and then start using it aloud. You'll know what works by the relief or concern you hear coming from the other person.

I'm not a character encoding expert, but here's my go at a short explanation: "Computers have to store each letter as a series of ones and zeros. Different operating systems/platforms used incompatible ways of storing each letter, and our database is using the most up-to-date and univesal method of storing characters. So when we receive your content, especially content that is more than 5 years old, our database doesn't necessarily understand it right away, and we have to try to translate the characters into the new universal character set."

The long explanation is basically that, plus a brief explanation of how you're using scripts ("computer programs" to the non-techies) to translate/fix the encoding.

Also: if you know much of the content will have encoding issues, why on earth are you running your scripts on the original content? Why aren't you saving the originals and using a copy? That's a problem right there, thought I'm probably overlooking something about why that's necessary.
posted by Tehhund at 2:16 PM on November 9, 2010 [1 favorite]

This article - "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" from Joel on Software is the best article I have seen yet on character encoding and why it exists and is written in very plain language. Taking this approach and distilling down for non-developers has worked very well in my company where we have done similar training.

I know you mentioned non-developer articles but wanted to be sure you had seen this one.
posted by Disco Moo at 2:19 PM on November 9, 2010 [1 favorite]

Don't. Don't try to explain it to them. They don't need to understand encoding issues. That's what they have you for. Life's too short for them to understand character encoding issues.

You can make it easier on yourself: Give them a template UTF8 word doc(x) and require that their content follow that template. This template is just a doc saved as UTF8 (save as->tools->web options->encoding). You might add some fields such as author/date and make it all workflow/archival-looking. That way, you're starting with UTF8 and any surprises will be seen in the doc they submit. If they submit docs without noticing the problems, you'll still have to troubleshoot, but... have them fix the problems, in their doc, upstream of your process.
posted by at at 2:24 PM on November 9, 2010

Best answer: Computers think of everything in terms of numbers. Text, images, video, music—it's all just a long sequence of numbers to a computer.

Recall the simple substitution ciphers you used as a kid: A=1, B=2, C=3, etc. This is exactly how computers represent text.

However, people have used a bunch of different ciphers over the years. In one cipher, A=1. In another, A=65. In a third, A=7080.

Not only that, but different ciphers include different sets of characters. Some only include the letters, numbers, and punctuation marks most commonly used in English. Others include additional characters that are useful for foreign languages. Some even include exotic stuff like mathematical symbols and Greek letters. Each cipher is useful in different situations.

So, in order to correctly interpret a piece of text (which, remember, is just a sequence of numbers to a computer), the computer has to know which of these many ciphers was used to encode it. And that's sometimes ambiguous.

Now, there's a certain amount of overlap between the different ciphers. For example, two different ciphers might use the same numbers to represent the letters A through Z—but use completely different numbers to represent the various punctuation marks.

It's a bit like two people who speak related but distinct languages—Spanish and Portuguese, say—trying to communicate. They'll understand some of the words, but others will come out as gibberish.

Old content that was created during the frontier days of the web—the late 90s and early 2000s—tends to be particularly garbled. Back then, people were just beginning to recognize the problem, and it was often the case that one piece of software was speaking Spanish when another was expecting Portuguese (figuratively, of course). Some documents have been garbled more than once, in different ways, as they were edited in different programs.

Yes, it's a mess.

So, a while ago, some computer scientists got together and invented a cipher called Unicode. Unicode solves the problem—too many different ciphers!—by providing one cipher which can represent any character, in any language, ever.

The Latin alphabet, Greek symbols, Japanese hiragana, Cyrillic, Egyptian hieroglyphics—even musical notation, alchemical symbols, and smiley faces. Unicode does everything. It eliminates the need for all these different ciphers—and all the confusion that comes with them.

But to reap those benefits, it's first necessary to translate old documents, which were encoded with all those different ciphers, into Unicode.

Yeah, this is probably more in-depth than they really need. Sometimes, when you know you're doing the best thing for the client, and they're just not getting it, you have to fudge the answer and give them the spirit of the thing, if not the literal truth. Something that will satisfy them enough to let you implement the damn thing already. Perhaps you can pick the elements above that will be most relevant/understandable to your client, and distill that down to a three-sentence pitch.
posted by ixohoxi at 2:25 PM on November 9, 2010 [1 favorite]

Best answer: "The new system requires all content to be migrated to a format called Unicode. All of our existing content is in different formats, none of them Unicode. Converting the existing content to the new format is non-trivial, because there are so many different formats in use. We have two problems: first, how are we going to migrate the existing content to the required format, and second, how are we going to ensure all future content uses the new format?"

In short, don't explain encoding to them -- explain the impact on their business/processes.
posted by davejay at 2:30 PM on November 9, 2010 [2 favorites]

Response by poster: Also: if you know much of the content will have encoding issues, why on earth are you running your scripts on the original content? Why aren't you saving the originals and using a copy? That's a problem right there, thought I'm probably overlooking something about why that's necessary.

Tehhund: sorry I'm not being clear; we have plenty of copies that we use to process. Of course we have backups of the originals of all of this, which is basically random copies of old web sites (many of which have been copied and processed by other developers at various points...). Also, some stuff is in a database already (the newer stuff we've built for them) whereas some is in static HTML which we have to parse and process the output of. I hope that makes it a bit more clear. I'm being in part vague because I want to keep the project as anonymous as possible, and it's rather distinctive once I start getting into it.

Disco Moo: yep, that one comes up in most any character encoding search, and it's a great article, it's helped me certainly!

at: for various reasons your suggestions are not practicable. One of the main issues is just explaining to them that the work has to be done in the first place. Because this costs them money at least a little bit of explaining has to be done. And as far as the Word doc idea is concerned, it's not that they submit a doc; it's that some folks use Word (or who knows what) to format their "posts" before they submit it to our web app, which is the core of what we are doing for them—whereas others just directly enter stuff in the the web app (as we would like them to). But thanks regardless, I definitely see where you're coming from, and in another situation this would be good advice to follow.

ixohoxi, Tehhund, davejay, I like what you've given me here, good stuff. Keep it coming folks, thanks!
posted by dubitable at 2:38 PM on November 9, 2010

Best answer: Oh, and if they ask why you can't just use the old formatting: "You want searching, and searching through all the content requires the content to share a common format. You want multi-language support in one system, and supporting multiple languages within one system requires the content to share a common format. If you want the search feature, and support for all languages, on this system, we cannot avoid migrating to a common format -- and the standard format to achieve this is Unicode."
posted by davejay at 2:42 PM on November 9, 2010

Best answer: Try this:

Computers use binary numbers to represent everything, including letters. Back in the old days, in the USA, the guys building computers needed to come up with a standard way of representing letters, and they invented ASCII, so an "a" is "00010001" (or whatever). This system could only represent 127 possible characters--uppercase and lowercase letters, numbers, some punctuation, and some control characters. It didn't even try to represent accents. So ASCII is an encoding system.

Obviously ASCII a problem if you want to represent French or German or Spanish, where they do need those accents. And it doesn't do you any good for Russian. And forget about Japanese or Chinese, where you need to be able to represent thousands of characters.

So the programmers in those countries came up with their own ways of representing characters. And sometimes different software companies came up with different ways to represent the same set of characters. Generally their encoding systems built on top of ASCII, but allowing for more characters. It was an electronic Babel. Aside from the fact that you needed to know how a page was encoded in order to decode it properly, it made it a real problem to mix different writing systems in one document.

Smart people recognized that this was a problem, and there were a few projects to create the One True Encoding System, Unicode being one of them, and the one that won. But in the early days of the Web, we were still stuck with the Babel of different encodings, because Unicode was still pretty new and not many computers could interpret it. So there are all these documents out there in Big-5 and Shift-JIS and ISO-Latin-1, etc.

Things have improved enormously, and now computers do support Unicode. Those old encoding systems are like appendixes that we need to get rid of before they burst. With the old documents, it's not always even clear what encoding system an document uses. So we need to clear that up and harmonize everything.
posted by adamrice at 2:55 PM on November 9, 2010 [1 favorite]

I'm sorry, but you'll never succeed in explaining this problem to your users. And they'll definitely never take responsibility for it, no matter how much you wish and plead.

I've internationalized several projects, and I have only once received a translation document that wasn't garbled to fuck and back by being run through MS Word. And that document was prepared by a professional technical translation and localization service, where the translators all understood character encoding. Every other localization, whether submitted from France or from China, was some fucked up Microsoft native character set.

You won't win this battle. People will disregard your explanations and your instructions. Unless the people you're working with are geeks, you cannot expect that they will listen to even the most patient and clear explanation of character encoding. They will continue to use MS Word, even when it's strictly forbidden at any point in the work flow. Asking them to use Textpad (or some other UTF-aware editor) will be greeted with roughly the warmth of a popsicle. For 99.9% of the computer-using population of the planet, even the ones who are computer literate, text is text is text, and you use Word to edit it, dumbass.

Your efforts are far better spent in devising technical schemes to detect and convert encoding (you know about the *nix 'iconv' program, right?). Alternatively, you can devote manpower to handling your encoding issues.
posted by Netzapper at 4:26 PM on November 9, 2010

Everyone's talking about Word rather as if its workings were a secret. I find that 99% of the time, if you've got "garbled" text or "special word characters" or "dumb quotes" or whatever people say who haven't read the Joel article, it's because you've got Windows-1252 encoding. It's not an encoding any of us likes, but what the hell, it's an encoding.
posted by AmbroseChapel at 5:14 PM on November 9, 2010

Response by poster: Netzapper, your comment reads as borderline hostile. Perhaps you didn't intend this, but at the least there is no need to call me "dumbass."

I recognize that in the end, people will do what people will do. However, it is important that we at least attempt to explain this to them for our own "ass-protection" reasons if nothing else. I am, in fact, aware of the iconv utility, and our scripts use that as well as other means to deal with the encoding issues we've come up against. However, this project has a limited budget, I have a limited amount of time to work on it, and assuming that our users are not going to be able to understand this at all is, actually, more of a waste of our time than giving them something to remind them we cannot deal with all of these problems for them all the time. And if we're going to bother doing such a thing we might as well not assume they are incapable of understanding it; that would be pretty cynical. And in fact, in other areas of the application, the instructions we've provided ("Please don't cut and paste directly from Word or another application, but write your answer directly into the form.") have actually helped.

So, while I understand the angle you're coming from, this is not the sort of response I'm looking for. Thank you anyways!

Otherwise, thanks folks, great stuff! I think I'll probably pick and choose from things here and there. davejay, I especially liked your part about the searching; I think that will really resonate.
posted by dubitable at 5:16 PM on November 9, 2010

Unfortunately, I'm going to have to mostly agree with Netzapper, although for some reason underexplaining while being a little frantic seems to have good results. Do not use the word "Unicode" if you can possibly help it.

"Word is just weird" works with some people, I think partially because a lot of people have run into Word being weird in one way or another. I was overly amused to read a post from someone on an intranet saying "Please do not copy/paste from Word as [epersonae] has promised me it will unleash gremlins that will consume our site."

Talking about how some computers have problems with "special" characters has worked; for some reason people get the curly quote issue better than most other things. "See that thing there? Yeah, that's what breaks stuff."

On the tech side, I've had pretty decent experiences with "paste from Word" buttons in WYSIWYG editors; they at least have the benefit of being easier to train with than getting people to open Notepad.
posted by epersonae at 5:16 PM on November 9, 2010

Response by poster: AmbroseChapel: It's not an encoding any of us likes, but what the hell, it's an encoding.

Yeah, you're totally right, and we are going to try dropping in a filter to deal with this. I'm concerned it won't be enough though, considering all the languages we have people working with; it's not just English and European languages but also Asian languages, Russian, etc. etc.

Also, I want to de-emphasize the Word thing which some folks are focusing on a bit: the bigger issue is actually the metric ton of ancient web pages we are only going to be able to do a crap job of working through unless these folks can pay us more.
posted by dubitable at 5:19 PM on November 9, 2010

Best answer: I managed the modernization into UTF-8 this year of a Web application that was backed by a database containing 5 GB of undifferentiated UTF-8, Shift-JIS, and CP1252 pasted in from Microsoft Word – so I feel your pain.

The users' expectation is driven by what's on screen, and from their point of view it's just some bullets and Wingdings that looked fine when they clicked "Submit" so it's your fault — and very embarrassing — when they get rendered as multiple ASCII characters on the Web site.

Note that you probably have not only CP1252 → Unicode issues, but also issues with illegal Wingdings characters. (No, encoding isn't your only problem.) So my explanation focused on why we "can't just filter out the weird Word characters and replace them", and the answer is that once you have clean UTF-8 you can reliably do so — but you can't do it with mixed UTF-8 and CP1252, so first you have to clean up the data.

clients and people they have been working with keep entering stuff that they've first written in MS Word or whatever and then pasted in

To make browsers send you UTF-8 even when another encoding is pasted into a form, add accept-charset="UTF-8" to the form element. You should also send the form page as UTF-8 — in Apache you'd add AddDefaultCharset utf-8 to the server configuration.

Then you can filter out the Wingdings (I'll memail you a regex), preferably at rendering time (it's best to store the user's unaltered input in the database for a number of technical, business, and legal reasons). This should fix your problem with pasted-in text that was formatted in Word.
posted by nicwolff at 5:58 PM on November 9, 2010

Best answer: the bigger issue is actually the metric ton of ancient web pages we are only going to be able to do a crap job of working through unless these folks can pay us more

I've found that patiently explaining the basic technical issue is the only way to make people happy. And it shouldn't be that difficult to do so without a lot of new jargon:
"When the site was separately maintained static HTML pages, each page could use its own encoding for its text. Now that all the pages are coming from one database, and the pages are built from template and share much of their text, they all have to use the same encoding, and therefore all the static HTML must be re-encoded."
And then get into the costs of re-encoding.
posted by nicwolff at 6:12 PM on November 9, 2010

Response by poster: nicwolff, these are some great tips, I really appreciate it. Thanks, I'd love it if you sent me that regex.
posted by dubitable at 6:16 PM on November 9, 2010

There are three or four different reasons why this is going to be hard for non-techie people to understand.

I always just say, "There's a lot of invisible stuff that you don't see, and the invisible stuff from 1999 doesn't translate well today. Just copy and paste everything into Notepad, then copy and paste it from Notepad into the window here."
posted by ErikaB at 7:27 PM on November 9, 2010

nicwolff, I wasn't calling you a dumbass. I was mirroring the sort of response I get from users on a regular basis. Perhaps, I should have put it in quotes. But, for most people, suggesting that they not use Word for editing text is like suggesting they tighten a bolt with a hammer.
posted by Netzapper at 9:13 AM on November 10, 2010

You've got the wrong guy but apology accepted on dubitable's behalf :) I totally agree w you Netzapper, having been through exactly this project before: users will paste in crap from Word no matter what you do, and it's our job to accommodate that. But it takes some doing, because encoding is only ⅔ the fight, you also have to map Wingdings to matching Unicode glyphs.
posted by nicwolff at 11:12 AM on November 10, 2010

Response by poster: Haha...that was a funny exchange you two.

Thanks for the apology and clarification Netzapper; I realize now what you intended.
posted by dubitable at 12:08 PM on November 10, 2010

« Older where to buy a business card case   |   Stereoscopic Vision Newer »
This thread is closed to new comments.