How do I automatically alphabetize a list containing accented characters?
July 29, 2010 9:59 AM Subscribe
How do I automatically alphabetize a list of words that contains macrons (accented characters)?
I'm writing a Latin-English glossary for a Latin text. The Latin words sometimes contain characters with macrons (a horizontal line over the vowel indicating that it is long: ā). Right now, I have a list of the words and definitions in the order they occur in the text; I would like them to be in alphabetical order. I would prefer not to alphabetize several hundred words by hand, obviously. I know there are easy utilities on the web for alphabetizing normal lists in English, but none that I could find knew what to do with macrons; they stuck them after the non-accented characters, when what they should do is treat the character with a macron as though it were the non-accented version (ā should be treated as just a). In other words, dīcō should precede doctus, not follow it, just as if it said dico.
My glossary list is just a text file (the macrons are Unicode characters), separated by line breaks. I'm using Mac OS 10.4, although I can access a PC at work, if that helps.
Thank you!
I'm writing a Latin-English glossary for a Latin text. The Latin words sometimes contain characters with macrons (a horizontal line over the vowel indicating that it is long: ā). Right now, I have a list of the words and definitions in the order they occur in the text; I would like them to be in alphabetical order. I would prefer not to alphabetize several hundred words by hand, obviously. I know there are easy utilities on the web for alphabetizing normal lists in English, but none that I could find knew what to do with macrons; they stuck them after the non-accented characters, when what they should do is treat the character with a macron as though it were the non-accented version (ā should be treated as just a). In other words, dīcō should precede doctus, not follow it, just as if it said dico.
My glossary list is just a text file (the macrons are Unicode characters), separated by line breaks. I'm using Mac OS 10.4, although I can access a PC at work, if that helps.
Thank you!
Best answer: I just tried sorting a list of accented (including ā) and unaccented characters, and it did the intuitively correct thing. This was on OS X 10.6 in BBEdit, but I'm pretty sure this will work on 10.4. If you don't want to spring for BBEdit, Textwrangler is free and uses the same text-editing engine. Though I'm pretty sure the character-sorting happens at the OS level, so it shouldn't matter what app you're in.
So I don't think you need to do anything fancy.
posted by adamrice at 10:12 AM on July 29, 2010
So I don't think you need to do anything fancy.
posted by adamrice at 10:12 AM on July 29, 2010
I just tried GNU sort on the last two lines of your post, and it seems to work the way you want - do you have a linux live cd lying around?
posted by Dr Dracator at 10:12 AM on July 29, 2010
posted by Dr Dracator at 10:12 AM on July 29, 2010
Best answer: I use the text editor BBEdit, and it sorts the way you want by default. Or at least it seems to work right based on a quick test.
BareBones also makes a free editor called TextWrangler; I'm pretty sure it has the same sort function as BBEdit.
posted by bcwinters at 10:15 AM on July 29, 2010
BareBones also makes a free editor called TextWrangler; I'm pretty sure it has the same sort function as BBEdit.
posted by bcwinters at 10:15 AM on July 29, 2010
Response by poster: Thanks, everyone! AskMe is the best. :)
The web-based tool burnmp3s found sorts the macronned vowels properly (although I'm not seeing how to make it ignore capitalization?), and my gf already had TextWrangler on her computer, which also worked. Yay! Thank you again!
posted by lysimache at 10:34 AM on July 29, 2010
The web-based tool burnmp3s found sorts the macronned vowels properly (although I'm not seeing how to make it ignore capitalization?), and my gf already had TextWrangler on her computer, which also worked. Yay! Thank you again!
posted by lysimache at 10:34 AM on July 29, 2010
OS X's terminal based sort should be able to do this - if the locale is set correctly. You're basically doing a dictionary (non-case sensitive ) sort, which isn't that hard to do, but you need to be aware of what your computer thinks is the right collation sequence.
(memories of writing a generalized multilingual sorting tool - particularly bad ones, when it came to Welsh sorting ...)
posted by scruss at 10:59 AM on July 29, 2010
(memories of writing a generalized multilingual sorting tool - particularly bad ones, when it came to Welsh sorting ...)
posted by scruss at 10:59 AM on July 29, 2010
The web-based tool burnmp3s found sorts the macronned vowels properly (although I'm not seeing how to make it ignore capitalization?)
That's weird, I hadn't noticed that. If you don't need to preserve the existing capitalization you could run them through Capitalize first letters or Lowercase on the same site to make the capitalization consistent.
posted by burnmp3s at 11:06 AM on July 29, 2010
That's weird, I hadn't noticed that. If you don't need to preserve the existing capitalization you could run them through Capitalize first letters or Lowercase on the same site to make the capitalization consistent.
posted by burnmp3s at 11:06 AM on July 29, 2010
Yep, the sort command obeys the locale setting when it comes to sort order:
$ echo -e "doctus\ndīcō" | LC_COLLATE=en_US.UTF-8 sort
dīcō
doctus
$ echo -e "doctus\ndīcō" | LC_COLLATE=C sort
doctus
dīcō
If it's not doing what you want change your locale setting. The 'C' (aka POSIX) locale setting is the old default from before the times of i18n/l10n, so if no locale is set that's what you get and it doesn't know anything about UTF-8 sequences.
posted by Rhomboid at 2:14 PM on July 29, 2010
$ echo -e "doctus\ndīcō" | LC_COLLATE=en_US.UTF-8 sort
dīcō
doctus
$ echo -e "doctus\ndīcō" | LC_COLLATE=C sort
doctus
dīcō
If it's not doing what you want change your locale setting. The 'C' (aka POSIX) locale setting is the old default from before the times of i18n/l10n, so if no locale is set that's what you get and it doesn't know anything about UTF-8 sequences.
posted by Rhomboid at 2:14 PM on July 29, 2010
This thread is closed to new comments.
posted by Madamina at 10:12 AM on July 29, 2010