Best practices for editing RTL languages for the Web?
October 30, 2012 9:29 AM   Subscribe

What tools do you use to edit Persian for the Web, preferably, but not necessarily, on a Mac?

I need to be able to take a Word doc in Persian and accurately convert it to UTF-8 HTML. (I'd prefer to do this without retaining all the bloated MS Word markup, but am willing to compromise on that.)

I also need to be able to edit that HTML later, and it's this second task that's driving me bonkers. I'm using BBEdit, and when I try to select a piece of text in the HTML, my cursor behaves in a completely (to me) unpredictable way. I click here, the cursor shows up over there. I try to use the shift key with the arrow keys to select text, and the selection shrinks or grows in directions and increments that baffle me. (WYSIWYG editors are, unsurprisingly, far worse.) As a corollary, the order of the characters as I see it in the markup is often not the same as when it's seen in the browser.

Complicating circumstances: These texts often include snippets of English (ltr), and I do not speak or read Persian (possibly not really relevant but I thought I should mention it so you could tailor your response to my level of knowledge ignorance).

posted by bricoleur to Computers & Internet (5 answers total) 2 users marked this as a favorite
In general, a text control that is displaying both ltr and rtl alphabets together is going to behave pretty strangely when you're moving the cursor around and selecting stuff. You may just be seeing the standard behavior.

Try experimenting with a web-based editor like CKEditor to see if it works any differently - here's its demo page - or just paste it into the text box here on MeFi.

An incredibly useful free tool for dealing with Unicode is the BabelPad text editor (for Windows, unfortunately). It's letting me insert a "start of left-to-right override character" that forces rtl alphabets to be handled as though they're ltr so that the interface works the way you're used to, but of course that makes everything backwards to a native speaker of the language. However, even with the options set to what I would expect to show non-visible characters I am unable to see the override character, so that might cause problems if you can't remove it once you've added it.

Worst case scenario you could convert all of the Persian characters to HTML entities (BabelPad can do this automatically) which would make everything work the way you expect in the source code while displaying properly on the web, I believe (if you're including "dir=rtl" on the HTML tags or in CSS styling) but would effectively make the text as source code uneditable for a native speaker without converting back.
posted by XMLicious at 10:38 AM on October 30, 2012

As you've learned it appears BBEdit does not fully support LTR languages: Does BBEdit support Japanese/Chinese/Korean language editing?:
BBEdit supports opening and editing files written in most left-to-right writing systems, including non-Roman languages such as Japanese, Chinese, and Korean.

However, BBEdit does not support editing content in right-to-left languages such as Hebrew and Arabic. (You may also encounter inconsistencies when working with languages that routinely employ conjunct characters, such as Devanagari.)
I have edited non-English text character sets, including some RTL, but not enough to be battle tested. I did some searching and found these pages: Some Tips for a RTL Language on a Mac and Persian and Farsi. There are a number of interesting facts about Persian on the second page that will help explain some of the oddnesses you are seeing, for example:
The Arabic script has two features which make it unique in terms of encoding. One is that it is written from right to left (or RTL ). The other feature is that the shapes of individual letters change forms depending on whether the letter is alone, at the beginning of a word, the middle of a word or at the end.

In order to process Arabic correctly, a software must be able to display text from right to left and make sure the letter forms are displayed correctly depending on their positions within a word. Unfortunately, there is incomplete implementation of creating correct letter forms in many software packages.
It's tricky when they languages are interspersed, I think you'll end up doing a lot of things like <span dir="rtl" lang="en">English Word<span> to account for overriding what I would guess would be an html tag like: <html dir="ltr" lang="fa">

And really, if you're not native or at least familiar with the languages involved, be sure to do a sample of this and show it to a native reader of both languages for a sanity check.
posted by artlung at 11:36 AM on October 30, 2012

I meant <html dir="rtl" lang="fa"> and <span dir="ltr" lang="en">English Word<span>

(right, left, what's the difference, ultimately?) :-P
posted by artlung at 12:52 PM on October 30, 2012

Thanks for the BabelPad link, XMLicious, that will come in handy. And thanks for the confirmation, artlung, that BBEdit actually doesn't even try to support RTL. I kind of figured that, but it's good to know it's not just user error.

For anyone in the future who might actually have a use for this question, here's a link to a white paper that details most, if not all, of the problems specific to presenting Persian on the Web. 2004, but still mostly applicable, I think.
posted by bricoleur at 10:56 AM on October 31, 2012 [1 favorite]

I forgot to mention - this guy seems to have gotten BabelPad running under Wine on Ubuntu Linux so maybe the same is possible on a Mac.
posted by XMLicious at 11:38 PM on October 31, 2012

« Older Medical Miracles in TV or Film?   |   New Kid to Over the Block Newer »
This thread is closed to new comments.