Eliminating odd characters from web site?
May 15, 2007 6:56 AM   Subscribe

I have a client (I make web sites) who emails me stuff to put on their site. Invariably they're Word or RTF files or just straight emails. Often, when I cut and paste their content, quotation marks and other punctuation turn into whacked out characters when appearing online. It drives me batty to go thru these (dozens pages long) documents every day to replace their 's with my own 's (which don't get all whacked out online). What can I tell them to do on their end or what can I do on mind to eliminate this problem. Them --> Windows. Me --> Mac. Site --> Linux.
posted by Manhasset to Computers & Internet (14 answers total) 6 users marked this as a favorite
 
I always take any content given to me and just do a find and replace in Text Edit or BBEdit to get rid of the funky stuff.

trusts no one....
posted by gomichild at 7:06 AM on May 15, 2007


With windows, you can save as .RTF with ascii charset, which should convert the funky stuff to normal stuff.
posted by markesh at 7:15 AM on May 15, 2007 [1 favorite]


I usually paste to a text editor (EditPad Pro is my personal favorite) first and then copy and paste from there. EditPad has been really good about not doing funky things to even the funkiest of Windows formatting, better than Notepad or some other text editors.
posted by Lyn Never at 7:17 AM on May 15, 2007


If you have Word, a find/replace for each offending character can be placed into a macro. OpenOffice might have the same capability.

It would help if we knew what your workflow is like. You get Word, RTF, and emailed updates. What do you use to post them on the site? Basic HTML? CSS+HTML? Dreamweaver? A PHP CMS?

A collection of macros to replace italics, paragraphs, intented portions, and other formatting with HTML/CSS coding will probably give you the most control if you're making html files from scratch.

On preview: markesh is on the right track.
posted by cowbellemoo at 7:21 AM on May 15, 2007


Have them 'Save as Web Page' before they send it to you.

Then, attack it with these two tools:

Demoronizer, correct moronic and gratuitously incompatible HTML generated by Microsoft applications

and...

Tidy, HTML beautfier and validator.

Running MS crap through these filters usually gives me good results.
posted by unixrat at 7:26 AM on May 15, 2007 [1 favorite]


Any solution that requires the client to follow a specific procedure is destined to fail at some point.

I'd recommend BBEdit and its Automator actions. There are lots of character conversion actions.
posted by mkultra at 7:39 AM on May 15, 2007


BBEdit's "Convert to ASCII" command (in the Text menu) will catch most, if not all, of these cases.
posted by jjg at 7:41 AM on May 15, 2007


You need to make sure your entire workflow uses UTF-8, then you won't get 'whacked out characters'. So check the Word files are UTF-8, use a UTF-8-compatible text editor, specify UTF-8 as the charset of the page (in both the HTML and HTTP headers), and everything should be fine.

It used to be easier to use ISO-8859-1 for pages and encode anything 'awkward' as HTML entities, but nowadays it makes sense to fully embrace Unicode.
posted by malevolent at 8:25 AM on May 15, 2007


Funny, I just had to do this last week. This is a simple PHP version that worked for me.

function clean_ms($texz) {
$texz = stripslashes(stripslashes($texz));
reset($find);
reset($replace);
$find[] = "\342\200\176";
$find[] = "\342\200\177";
$find[] = "\342\200\230";
$find[] = "\342\200\231";
$find[] = "\342\200\232";
$find[] = "\342\200\233";
$find[] = "\342\200\234";
$find[] = "\342\200\235";
$find[] = "\342\200\041";
$find[] = "\342\200\174";
$find[] = "\342\200\220";
$find[] = "\342\200\223";
$find[] = "\342\200\224";
$find[] = "\342\200\225";
$find[] = "\342\200\042";
$find[] = "\342\200\246";

$replace[] = "'";
$replace[] = "'";
$replace[] = "'";
$replace[] = "'";
$replace[] = ',';
$replace[] = "'";
$replace[] = '"';
$replace[] = '"';
$replace[] = '-';
$replace[] = '-';
$replace[] = '-';
$replace[] = '-';
$replace[] = '--';
$replace[] = '--';
$replace[] = '--';
$replace[] = '...';

$texz = str_replace($find, $replace,$texz);
return $texz;
}
posted by lubujackson at 8:27 AM on May 15, 2007


Best answer: If you're on a Mac, the easiest solution is the Webfrog widget.

Although listed at Apple.com, Kelibo is no longer there so I put Webfrog on my server. Enjoy.
posted by nessahead at 9:12 AM on May 15, 2007


I use the 'Transliterate to ASCII' command in TextMate.
posted by chrismear at 10:16 AM on May 15, 2007


It's also easy to do with a simple Perl script on the server, if you have SSH access. I've done it myself; you could probably achieve it in fewer than 15 lines of code. Gotta love regular expressions!

Sorry I don't have the code handy right now. You have PHP example, anyway.
posted by amtho at 10:59 AM on May 15, 2007


the cheap and easy way I do this at work (well not as easy as macros and perl, but it works) is to copy the whole document, paste into windows notepad, then copy paste into a new word doc. Depending on how messed up the text is (and I do mean messed up in the worst sense) I sometimes "clean" the text in notepad before transferring back to MS word.

I use OpenOffice.Org at home. Slowly trying to get everyone here to convert.
posted by Lizc at 2:38 PM on May 15, 2007


Response by poster: Awesome. Thanks all, especially nessahead!
posted by Manhasset at 9:26 AM on May 17, 2007


« Older Mystery Cable: Better used as rope, or sold to...   |   Keep my own records? Newer »
This thread is closed to new comments.