How do I automatically alphabetize a list containing accented characters?
July 29, 2010 9:59 AM   Subscribe

How do I automatically alphabetize a list of words that contains macrons (accented characters)?

I'm writing a Latin-English glossary for a Latin text. The Latin words sometimes contain characters with macrons (a horizontal line over the vowel indicating that it is long: ā). Right now, I have a list of the words and definitions in the order they occur in the text; I would like them to be in alphabetical order. I would prefer not to alphabetize several hundred words by hand, obviously. I know there are easy utilities on the web for alphabetizing normal lists in English, but none that I could find knew what to do with macrons; they stuck them after the non-accented characters, when what they should do is treat the character with a macron as though it were the non-accented version (ā should be treated as just a). In other words, dīcō should precede doctus, not follow it, just as if it said dico.

My glossary list is just a text file (the macrons are Unicode characters), separated by line breaks. I'm using Mac OS 10.4, although I can access a PC at work, if that helps.

Thank you!
posted by lysimache to Computers & Internet (9 answers total)
 
Is there a reason why Excel won't work? I just tried it, and it worked fine for me. If you don't have Excel, use the Open Office version.
posted by Madamina at 10:12 AM on July 29, 2010


Best answer: I just tried sorting a list of accented (including ā) and unaccented characters, and it did the intuitively correct thing. This was on OS X 10.6 in BBEdit, but I'm pretty sure this will work on 10.4. If you don't want to spring for BBEdit, Textwrangler is free and uses the same text-editing engine. Though I'm pretty sure the character-sorting happens at the OS level, so it shouldn't matter what app you're in.

So I don't think you need to do anything fancy.
posted by adamrice at 10:12 AM on July 29, 2010


I just tried GNU sort on the last two lines of your post, and it seems to work the way you want - do you have a linux live cd lying around?
posted by Dr Dracator at 10:12 AM on July 29, 2010


Best answer: I use the text editor BBEdit, and it sorts the way you want by default. Or at least it seems to work right based on a quick test.

BareBones also makes a free editor called TextWrangler; I'm pretty sure it has the same sort function as BBEdit.
posted by bcwinters at 10:15 AM on July 29, 2010


Best answer: Sort Lines on textop.us does this.
posted by burnmp3s at 10:16 AM on July 29, 2010


Response by poster: Thanks, everyone! AskMe is the best. :)

The web-based tool burnmp3s found sorts the macronned vowels properly (although I'm not seeing how to make it ignore capitalization?), and my gf already had TextWrangler on her computer, which also worked. Yay! Thank you again!
posted by lysimache at 10:34 AM on July 29, 2010


OS X's terminal based sort should be able to do this - if the locale is set correctly. You're basically doing a dictionary (non-case sensitive ) sort, which isn't that hard to do, but you need to be aware of what your computer thinks is the right collation sequence.
(memories of writing a generalized multilingual sorting tool - particularly bad ones, when it came to Welsh sorting ...)
posted by scruss at 10:59 AM on July 29, 2010


The web-based tool burnmp3s found sorts the macronned vowels properly (although I'm not seeing how to make it ignore capitalization?)

That's weird, I hadn't noticed that. If you don't need to preserve the existing capitalization you could run them through Capitalize first letters or Lowercase on the same site to make the capitalization consistent.
posted by burnmp3s at 11:06 AM on July 29, 2010


Yep, the sort command obeys the locale setting when it comes to sort order:

$ echo -e "doctus\ndīcō" | LC_COLLATE=en_US.UTF-8 sort
dīcō
doctus

$ echo -e "doctus\ndīcō" | LC_COLLATE=C sort
doctus
dīcō


If it's not doing what you want change your locale setting. The 'C' (aka POSIX) locale setting is the old default from before the times of i18n/l10n, so if no locale is set that's what you get and it doesn't know anything about UTF-8 sequences.
posted by Rhomboid at 2:14 PM on July 29, 2010


« Older Excel Short cut Question   |   I am looking to buy a reasonably priced Windows 7... Newer »
This thread is closed to new comments.