Blundering around in statistics
March 24, 2017 6:44 PM   Subscribe

I am trying to understand some statistical data in a piece dealing with a historical text corpus and am in over my depth, which doesn’t take much doing. I’d be grateful for a simple explanation of what’s going on, plus the technical terms for this phenomenon, if it is one.

My author is analyzing a historical corpus of 20 texts of greatly varying lengths - the longest is almost 15 times longer than the shortest. They are counting the number of instances of a variable (let’s say rhymes, though it isn’t) in each text and figuring out which text has most instances, both absolutely, and proportionately to its length. Unsurprisingly, the absolute number of instances in the longer texts is greater than in the shorter ones: more syllables = more potential for rhymes. That I get, though it doesn’t seem all that useful a result. What surprises me and makes me wonder if something fishy, aka probability-related, is going on, is that the relative frequency is exactly the reverse: when you work out rhymes per word for each text, and multiply by 1000 to get a ‘rhymes per 1000 words’ figure, the shortest texts have the highest relative frequency.

Is there something going on here, probability-wise? Or am I just being a doofus and the shortest texts really do just so happen to all be more densely rhymey than the longer ones?
posted by ogorki to Science & Nature (9 answers total) 1 user marked this as a favorite
 
I can't imagine any statistical reason for the shorter texts having greater rhyme density. It seems the answer would probably lie in the nature of the texts. For example, the longer texts might be scholarly tomes which are less likely to contain short words that might inadvertently rhyme with other short words. The other possibility is that the finding is anomalous. After all, twenty is a really small sample size, making it easy for such an anomaly to occur, while the sample size for the absolute number of instances is far larger (i.e., all the syllables across the twenty texts).
posted by DrGail at 7:03 PM on March 24, 2017


It's certainly possible that there is a reason for this, but without knowing more about the texts, and exactly what is being counted, we have no way to explain this. Sorry.
posted by yeolcoatl at 7:47 PM on March 24, 2017 [1 favorite]


Best answer: Well, if it's something that's more likely to happen at the beginning or end of a text, it would make sense for the shorter ones to have more. For example, if you were counting phases like "Once upon a time" or "The End, " then one-page stories would have them on every page, while they would only show up on 10% of the pages in a 10-page story.

But for something like rhymes, where you'd expect it to be pretty evenly distributed through the text? Yeah, I think the shorter ones really are more densely rhymey.
posted by selfmedicating at 8:10 PM on March 24, 2017 [1 favorite]


If there are more rhymes per kilo-word in the shorter text, then they're more densely rhymey, period. This is like measuring how tall someone is; it just is.

I think what you're trying to get at, maybe, is whether they're more densely rhymey *because* they're shorter, so maybe the long texts are rhymier than they "should be," but the shorter texts are less rhymey than you would expect for works of that length. This isn't a straightforward statistics problem; you'd need to be able to clearly specify what you would expect if nothing interesting were happening.
posted by ROU_Xenophobe at 8:21 PM on March 24, 2017


Response by poster: Selfmedicating has it! It is indeed a feature (close enough to rhyme to make no difference) that is commoner at the beginning and end of the text. I should have thought of that... So (a follow-up, if I may) there is probably no meaningful way of comparing rhyme frequency across this group of texts, then?
posted by ogorki at 8:28 PM on March 24, 2017


You could, in theory, analyze just the middle of the documents - to continue Selfmedicating's example, discard the first four words and then count the "once upon a time"s, but it may be very difficult to do that in a standardized way.
posted by Homeboy Trouble at 8:42 PM on March 24, 2017


Right, you can't have a meaningful comparison across the entire group of texts without further grouping them. For example, if half the documents was loan agreements of various length, and the other half was restaurant menus of various length, you could say that loan agreements are more likely to have this feature than restaurant menus. But then you would be tallying "does this document display this variable", not "over 1000 words, this variable is displayed X times."
posted by batter_my_heart at 10:27 PM on March 24, 2017 [1 favorite]


Best answer: You could divide the texts up into chunks of 20 or 100 or 1000 words (or whatever's a reasonable chunk for your text size) and make graphs of how often X occurs across chunks for each text. If the beginning-and-end explanation is correct, the charts will be higher at the beginning and end and sag in the middle, and it'll also be a nice visual explainer: long texts have more middle so their overall X frequency is lower. You might see unexpected patterns show up too.

You could also try with rolling samples - instead of words 1-100, 101-200, etc. do it with 1-100, 2-101, 3-102 and so forth. And see what it looks like if the chunk size is proportional to the overall length of the text. Or chunks based on paragraphs or chapters, if those are appropriate divisions of your texts.

I do this kind of thing all the time, but for art, not scholarship, so it's possible my suggestions have no statistical interest! But they might still help you see patterns.
posted by moonmilk at 4:48 AM on March 25, 2017 [1 favorite]


Best answer: I'm late to the party, but wanted to comment that there is one other way to examine this. If there is an expectation that rhymes per 1000 words should be inversely proportional to the length of the text, then you can control for that using a linear regression, and look for books which deviate from the value you'd expect based on their overall length. That is like graphing the number of rhymes per kiloword against total number of words. You'd expect a diagonal line, but some of the books will fall above or below the expected line; those are the books that might tell you something interesting.
posted by agentofselection at 3:11 PM on April 13, 2017


« Older I want to set up a vpn   |   Games with Bands Newer »
This thread is closed to new comments.