How to replace non-ascii characters with HTML entities?
February 27, 2013 8:32 AM   Subscribe

I’m currently working on an international project and need to format documents written in German, Polish and Croatian for use on the web. What is the quickest way to convert all the non-ASCII characters into the relevant character entity? Some sort of web form would be my ideal solution, but I’d settle for a freeware program.
posted by the latin mouse to Computers & Internet (8 answers total)
 
Not sure if this works but if you are using Word and export as an HTML doc does that convert them for you?
posted by bitdamaged at 8:39 AM on February 27, 2013


Googling "convert to character entity" produced a lot of simple web tools that will do this, but the best solution is to simply declare the web page as having the UTF-8 character type and keeping the characters unconverted. 間も無く is a lot easier to read than x9593;x3082;x7121;x304f;
posted by adamrice at 8:43 AM on February 27, 2013 [1 favorite]


Seconding adamrice; please try to keep everything in UTF-8, from creation and editing to publishing. It adds a few quirks and points to watch, but solves more problems than it creates. I used to do multilingual publishing in pre-Unicode days, and it sucked.

HTML Tidy is the tool for cleaning up and concerting HTML between character encodings.
posted by scruss at 8:59 AM on February 27, 2013


Thirding the recommendation to use UTF-8 just in case you're tempted to do anything else. At worst you might need to carefully convert some oddly-encoded content to UTF-8, but once you've done that Unicode really does solve this problem. Make sure your whole workflow is UTF-8 friendly - text editors, web page headers/meta tags, etc.

Depending on the nature of the documents, you might be best off keeping them as plain-text-like as possible then processing them into the final HTML, e.g. via Markdown. Sometimes this can make getting people to make edits etc. easier & less error-prone, especially if you're dealing with languages you're unfamiliar with (and therefore less likely to spot any errors yourself).
posted by malevolent at 9:16 AM on February 27, 2013


I believe this is what you're seeking - Unicode Code Converter. Enter text into the "Mixed input" and hit convert. Hexadecimal NCRs or Decimal NCRs is what you're looking for.

Or you can just do Unicode like others suggested.
posted by pyro979 at 11:13 AM on February 27, 2013


Response by poster: I'm not sure how UTF-8 fixes the problem. Could somebody please explain this to me as if I was an idiot?
posted by the latin mouse at 2:05 PM on February 27, 2013


There are lots of ways to encode characters. UTF-8 is one, and it may not be perfect (I don't think it covers Klingon or Sindarin), but it's pretty near universal.

When you set up a web page and declare that the characters on it are encoded in UTF-8, that means the characters will be transmitted and displayed in their original form, without being re-encoded as something else (like html character entities). That's the deal with Unicode (of which UTF-8 is the most popular encoding): if your software accepts Unicode, then it's promising that it won't break your text.

Basically, if you can paste or write your text in a plain text editor and have it look OK there—and pretty much any modern text editor will handle Unicode—then as long as you declare it as UTF-8 and save it as UTF-8, you're good (don't do something cute like save it as UTF-16).

Alternatively, if you're using a CMS for your project, and it was written in, oh, the past decade, it's probably using UTF-8 for everything already and you don't need to do anything special.

Sometimes you may find that documents that were written in MS Word will have a few hinky characters when you copy the text to something else. That's one thing to watch out for.
posted by adamrice at 4:37 PM on February 27, 2013


To further clarify the entities thing - if you use Unicode, you'll only need to entity-encode ampersand, less-than and greater-than (plus quotes for anything going into attributes), rather than everything non-ASCII.
posted by malevolent at 1:18 AM on February 28, 2013


« Older Cold Feet or Reasonable Doubts?   |   Learning a neutral accent and DIY speech therapy Newer »
This thread is closed to new comments.