Corpus Computational Linguistics
July 1, 2006 11:28 PM   Subscribe

Corpus/computational linguistics broad question: I'm a religious studies scholar, but I have a tremendous amount of interest in one of the six Middle Iranian languages. There is substantial amount of its complete corpus already entered in Unicode, and I want some advice on how to make it work for me. Books, programs, websites etc. are all fair game.

Collocations are key, especially in the analysis of certain religious concepts, but it's been a long time since Linguistics 101, and I need all the help I can get. I've also entered in a fair amount of information on my own (wordlists, mainly), and I'm looking for the best way to keep all this stuff straight.
posted by AArtaud to Writing & Language (13 answers total) 2 users marked this as a favorite
 
Response by poster: ah, well, I realize it is vague. I think at some point I'll have to look into a SQL database since one of the problems with this language is that some words are just unknown, and finding quickly all attested forms would be a big help. But I'm also looking for a good guide to computational linguistics. I was just looking at Amazon and the recent version of the Oxford Guide to Computational Linguistics. I'm also thinking of purchasing Concordance, the software program. Any suggestions?
posted by AArtaud at 1:10 AM on July 2, 2006


Concordance is pretty cool. So is WordSmith, which has a free trial version.
posted by lunchbox at 5:57 AM on July 2, 2006


IAACL (I am a computational linguist)

There are two books that are the bibles of computational linguistics. The first, Foundations of Statistical Natural Language Processing is probably more of what you are looking for and has lots of formulas. Speech and Language Processing is also a great resource, and is great for explaining CL concepts. Both talk in depth about collocations.

There are many packages of tools for finding Collocations, building word lists and looking for common n-grams. Toolkits exist for finding these things and I've written a lot of code personally for these specific problems. Please contact me if you are interested. Also, if you don't know much about programming Excel is your friend. Just remember to save as a non-proprietary tab-delimited file.
posted by Alison at 6:35 AM on July 2, 2006 [1 favorite]


Depending on the size of the corpus, you might look at DEVONthink Professional, a freeform database that provides a concordance and tools to identify which documents are most closely related. It's available for Mac OS X, and you can download a free trial from Devon Technologies' website.
posted by brianogilvie at 7:30 AM on July 2, 2006


You have to check out Morphix, the computational linguisitcs Linux distribution.
posted by scalefree at 8:28 AM on July 2, 2006


Whoops, got that wrong. make it Morphix-NLP. Been a while since I looked at it, sorry.
posted by scalefree at 8:30 AM on July 2, 2006


I'm not entirely sure what your asking. But XML can be a godsend for putting text into an e-friendly form. Perl can be a godsend in transforming text (I just used perl to convert a 70mb XML file to PML [palm markup language]. It was relatively painless). Marking the semantics of the text with XML, rather than the presentation, can be very good, regardless of what you plan to do with the text.

The TEI has an excellent, short introduction to XML, the reasons why you might want to use it, and a bunch of DTDs for various types of text.
posted by teece at 10:46 AM on July 2, 2006


Oops, forgot the TEI Link.
posted by teece at 10:47 AM on July 2, 2006


I should add one more thing, sorry. You say your text is already entered in Unicode. This is really impossible -- Unicode is really just an abstraction, it does not end up on the computer as bits and bytes. Your text is entered as a particular encoding of characters from the Unicode code points.

This, I'm sure, sounds like insane, arcane, programmer jibber jabber if you are not familiar with what Unicode is.

But if you are going to do any "rolling your own" type of analysis of non-ASCII text with a programming language, you are going to have to learn the distinction between Unicode and a particular Unicode encoding. As someone that recently started scripting some stuff with Unicode, I know from experience. I thought UTF-8 = Unicode = end of story. Wrong. It's not that simple. But it's not that hard, either. (The TEI link above even points to a simple intro to Unicode).

Sorry if all of this is way off base.
posted by teece at 10:58 AM on July 2, 2006


Response by poster: Thanks for the great answers, everyone. I suppose I should have thought on the question more carefully. What it boils down to is this: what can these tools provide to help me with my research in religious studies? The language itself is Khotanese, and it's deader than a doorknob. There is no one dictionary to refer to, but since I want to be working with this language for the next couple of decades, I thought getting started in a rigorous stastitical way might be useful. Any means to that end is what I was aiming for.
posted by AArtaud at 9:40 PM on July 2, 2006


I'll defer to others on the books but you'll find all the tools you need to learn about the concepts, already built & installed into a LiveCD Linux distribution with Morphix-NLP. Just download & burn the ISO onto a CD & boot it, no need to install it on your hard drive even. It's got a ton of quality tools on it, ready to use.
posted by scalefree at 5:31 PM on July 3, 2006 [1 favorite]


I work with low resource languages and just having digital resources is a blessing. With text files I can give you lists of words and their frequencies, lists of collocations, n-gram lists, etc. I know of one person doing automated allomorph discovery and another doing morpheme discovery. Raw data is very valuable in my field, so I'm sure we could work something out if you are willing to share what you've got.

At the very least please consider putting in the OLAC archive.

posted by Alison at 7:55 PM on July 4, 2006


I've used WordCruncher to concordance water-related terms in the Book of Isaiah (it handles non-Roman writing systems adroitly) and to search for collocations of flatter in the Book of mormon (strangely, it collocates very strongly with terms describing political rebellion).

On the Indo-Europeanist side, the TITUS WordCruncher Server has a variety of texts in Iranian languages, which might have more or less useful overlap with Khotanese, either as examples to show you how the technical issues are handled, or as comparative material.

I know there are organizations that do religious studies in ancient languages and in which WordCruncher and other technologies are used and loved, but I don't know exactly what people to refer you to. You might try getting in contact with theCenter for the Preservation of Ancient Religious Texts. In fact, depending on what texts you work with, you might keep an eye on or submit your work to their sister project, the Middle Eastern Texts Initiative.

OK, I'm done spamming for my alma mater now.
posted by eritain at 10:02 PM on April 17, 2007


« Older Where to watch the NYC fireworks show   |   Help me figure out the best wireless instrument... Newer »
This thread is closed to new comments.