How do I categorize sentence structures?
September 13, 2011 8:31 AM   Subscribe

I'm looking for academic linguistic papers and/or books on classification of sentence structures. (Should I turn on the languagehat signal?)

I'm doing a PhD thesis on authorship attribution and Elizabethan drama, and I'm reaching a point where I need to start extracting and manipulating my data. What I'm trying to do is see if there's a stylistic difference between authors based on the grammatical patterns of the sentences they use. (E.g., "The king rises", determiner-subject noun-intransitive verb.)

The problem I foresee is that since sentences are such flexible modular things, especially in this sort of text, I could wind up with a number of patterns that's way too large for me to do anything statistically meaningful with. Has anyone done any work on categorizing this sort of linguistic data into broad-but-manageable areas for anything close to this type of research?

I'm trawling through the MLA Bibliography and JSTOR as usual, so this isn't (too much of) a "Do my homework for me!", but I thought I'd pick Metafilter's collective brain.

(Also, this isn't a "Did Shakespeare actually write Shakespeare?" thing; those people are nuts.)
posted by Mr. Bad Example to Writing & Language (9 answers total) 1 user marked this as a favorite
There might be *much* more usable and accessible stuff out there, but these two things came to mind:
posted by zeek321 at 8:38 AM on September 13, 2011

Rather than looking at grammar alone, you might think about looking more closely at the classical idea of "figures of speech." There is a fuzzy area where these things overlap and you might find that figures of speech (such as kennings in ee cummings, asyndeton in Hopkins, and, well, all of them in Shakespeare) are a useful means of classifying the stylistic features of an author (specifically of a well-known Elizabethan dramatist). It seems that the best Elizabethan dramatists have a peculiar and very sophisticated way of manipulating figures of speech that are revealing of a particular character rather than simply using them to be dazzling or clever.

Grammatical structures and their identification and classification are pretty fuzzy, especially once you start throwing in the fact that grammar as a formal discipline (in the way we think about it) didn't come about until after the Elizabethans.

I would choose a few patterns, identify the context, show that a pattern exists, and then identify places where the pattern *should be* but doesn't exist.
posted by madred at 8:52 AM on September 13, 2011

As stated, you're looking for tree-tagged or parts of speech-tagged corpora of Elizabethan printed text. I suspect they exist and are documented, but I don't know of any off the top of my head. I suggest you identify some NLP researchers and write to them with this question.
posted by Nomyte at 8:55 AM on September 13, 2011

Two options

If you're looking at the grammatical structure of corpus sentences, then you're probably going to need a syntatic parser, of the sort that people like John Hale work on. In this case it would be up to you to identify the categories in the syntax.

Alternatively, you could use something like a textual data mining approach to generate clusters. I know a great deal less about data mining, other than it is a black black art. But it gets results. You could adapt something like AGNES to determine how many categories your really have, and then something like k-means to figure out what's in them. Using a method like this will not necessarily focus on the grammar, it would just tell you how the text is more alike or more different, depending on what you feed into it, which may have some surprises for you. The hard part of doing this is that while working with sentences, your initial dimensionality is going to be infinite unless you think of a clever way to encode them.
posted by yeolcoatl at 8:58 AM on September 13, 2011

You need to read a syntax 101 style textbook. The reason linguists don't classify sentences quite in the way you are thinking of is exactly because there are too many possible options, as you point out.

You need to learn about phrase types so that you can break your sentences down into phrases (e.g. that's a NP + VP, and the VP consists of just a V, or the VP consists of V + NP). Tree diagrams will help you here, and a shortcut (to be implemented with caution if you don't know enough to check the accuracy of the diagrams is to input sentences of interest into an NLP parser.

Alternatively you can use concepts of traditional grammar to identify a few major clause and phrase types (active vs passive, complex vs simple, coordination vs subordination) and search your text for frequencies of these.

IAAL (I am a linguist).
posted by lollusc at 8:58 AM on September 13, 2011 [4 favorites]

This is what a parser is like, by the way, although I'm not suggesting you necessarily use this one.
posted by lollusc at 9:01 AM on September 13, 2011 [1 favorite]

One more comment: you don't need to categorise every type of sentence that appears in the works you are interested in. You can do something like 'number of passive clauses per 100 clauses' or similar, and even that one feature will probably vary a lot between the authors you are interested in. To test statistical significance, I guess you would want to check whether it varies more between authors than across different works by a single author.
posted by lollusc at 9:03 AM on September 13, 2011

Instead of trying to tackle the whole language and all its grammatical possibilities, it's better to pick a syntactic variable that you think has some social or otherwise meaningful significance as to which variant (of your variable) is chosen – a micro feature, indicative of a macro process. Then you extract all variants of your variable from the texts and run tallies and stats and come up with a plausible explanation as to why. Your variable becomes and indicator of a larger language trend. If you track several variables this way, you have a bundle and that's even more interesting. Some possible variables for this sort of data might be the dative alternation (I game him the book vs. I gave the book to him) or short embedded phrase before longer one in a particular types of clauses.

Check out this article (and the actual journal article ref'd - and its references) here. It might be just what you're looking for, or at the very least, a good place to start.
posted by iamkimiam at 10:13 AM on September 13, 2011

How committed are you to focusing on syntax?

Sounds like you might be aware of this already, but (on preview, expanding on yeolcoatl's answer a bit) most of the statistical authorship attribution work I've seen has taken a 'bag of words' approach, where the goal is to identify the distribution of words (rather than sentence types) and compare this distribution across authors, genres, or individual works. It's a different problem than the one you pose, but it's not accident that bag-of-words models are where a lot of questions in natural language processing start- they can be much more tractable than parser-based approaches. This approach, by the way, may be better suited to answering a question like "texts/authors A and B are different" than to "text A is like THIS, text B is like THAT"

On more preview, yes, echoing iamkimiam, if you really want to do syntax, picking a few variables of interest to track will be much more tractable.
posted by heyforfour at 12:50 PM on September 13, 2011

« Older What's your Android 3 wearing?   |   Sugar in Market Spice? Newer »
This thread is closed to new comments.