Letter Frequency: Which Letters for Which Letters?
May 21, 2011 4:22 PM   Subscribe

I would like to know how often one letter is followed by all the other letters. For example: A is followed by R 20% of the time, while A is followed by Q only .1% of the time.

Surely someone has done research on this. I want to know the "odds" that a letter is followed by other letters. So, if I type A, what are the 5 most likely letters I will type after that?

I'm currently running a brute-force AppleScript to solve the problem, but I'm looking at 30-35 hours for 115,000 characters of randomly selected text. (It's not very efficient code.)

This is for a keyboard concept where the keyboard changes dynamically depending on previous input. (Yeah, not to original an idea, but, y'know…)

Thanks!
posted by 47triple2 to Writing & Language (15 answers total) 7 users marked this as a favorite
 
Equivalently, you want to know how frequently AQ, AR, etc. appear in text. So what you want is called, I believe, a table of digraph frequencies. The only complete one I could find was this one -- all the others just give lists of the most common digraphs -- but it's based on a pretty small sample. Who knows, you might be better off just creating your own table.
posted by madcaptenor at 4:30 PM on May 21, 2011


What you're looking for is a 1st order Markov representation of English. The ngram corpus from google can likely be processed to yield an answer, but it's a huuuuuuuuge download.
posted by pwnguin at 4:41 PM on May 21, 2011 [1 favorite]


Even if you knew the percentages of the odds of one letter following the next letter, that surely wouldn't be the way to make an adaptive keyboard. Seemingly what matters is what letters people normally type. Which would be based on what words people commonly use NOT on the odds using every word in existence.
posted by travis08 at 5:06 PM on May 21, 2011 [1 favorite]


This table seems to have what you're asking: linguistic digram frequency tables.
posted by jasonhong at 5:24 PM on May 21, 2011


Also, if you want to do a soft dynamic keyboard, I'd strongly advise you to read Scott McKenzie's past work in the area, as well as Shumin Zhai's work on ShapeWriter. These two people have done tons of work on text input and are among the most knowledgeable people on the planet on text input. Reading their work can help you avoid common pitfalls and from re-inventing the wheel.

For example, one of the common pitfalls I've heard is that there is a tradeoff between changing the keyboard and predictability. That is, is the gain you get from changing the shape, size, or position of the next typed key far more than the time it takes for people to find and hit that key? Also, is the shape and size of keys hidden (as it is with iPhone and Android) or is it actually shown to users?
posted by jasonhong at 5:33 PM on May 21, 2011


Response by poster: travis08: "Even if you knew the percentages of the odds of one letter following the next letter, that surely wouldn't be the way to make an adaptive keyboard. Seemingly what matters is what letters people normally type. Which would be based on what words people commonly use NOT on the odds using every word in existence."

Wouldn't "what letters people normally type" be the letters that most often follow other letters?

The basic concept for my theoretical keyboard is that if you type the letter "A", then the letters Q, R, P, and M show on screen most prominently because those are the most often "next-typed" letters.
posted by 47triple2 at 5:35 PM on May 21, 2011


Response by poster: jasonhong: "For example, one of the common pitfalls I've heard is that there is a tradeoff between changing the keyboard and predictability. That is, is the gain you get from changing the shape, size, or position of the next typed key far more than the time it takes for people to find and hit that key? Also, is the shape and size of keys hidden (as it is with iPhone and Android) or is it actually shown to users?"

This would be a keyboard that would completely ignore that pitfall. The keyboard would change dramatically from keypress to keypress. This, for me, is an experiment in programming/just having fun. In no way would I expect my keyboard to gain traction as it would be WAY to dynamic for any percentage of users to pick up and use with speed.
posted by 47triple2 at 5:37 PM on May 21, 2011


Best answer: Here is a Python script that counts 500k characters/second. And here is a digram based on the Metafilter post title dump from infodump. Entries look like this:

a-l = 0.100338

Which means "The probability of a being followed by l is 10.0338%"

But to really do this right, you want to look at more than two-character sequences, you want to look at non-letter keyboard characters, and I'm not sure that analyzing an English corpus is the right approach because people type a lot of meta-characters like delete and escape that don't appear in a corpus. What about your keyboard learning directly from the user as the user types?
posted by qxntpqbbbqxl at 6:04 PM on May 21, 2011


You really should look at Dasher, which works on similar principles..
posted by suedehead at 6:07 PM on May 21, 2011


Best answer: Wouldn't "what letters people normally type" be the letters that most often follow other letters?

Depends on the corpus you pull from. There's a lot of words in Shakespeare I'm not likely to type, and words we don't type much in informal communication that will be better represented in say printed books. Basically, ask yourself how often you pull up dictionary definitions and thesaursus while typing, and how that compares to the corpus you're basing these statistics on.

There's a number of different corpora available; I know researchers use the Enron email archives for mining company network structures. Every half decent nerd has a few hundred megabytes of uncompressed IRC logs, and Twitter's becoming a popular source for analysis.

To my mind, you can easily improve the design you have now through a number of mechanisms, the simplest of which would be increasing the markov order.
posted by pwnguin at 6:13 PM on May 21, 2011


Response by poster: suedehead,

Dasher is exactly the prinicple I'm working under! The only difference being that the buttons will be pressed rather than flying past.
posted by 47triple2 at 6:16 PM on May 21, 2011


A stupid C version, 6,551,666 chars/s, probably mucked up by such a small corpus. Includes 'space' with A-Z mapped to a-z for a 27x27 table.

Dasher was pretty neat last time I looked at it. I keep imagining that it could be pretty sweet with a big touch-screen if it could do some sort of polar layout so you could just poke the screen and make swirles. After training it would be like casting a spell or something.
posted by zengargoyle at 7:50 PM on May 21, 2011 [1 favorite]


I don't know if it would help any, but the Ward-Stone Ireland steno machine (used worldwide as a steno machine) follows the principles in your question.

http://en.wikipedia.org/wiki/Stenotype
posted by stenoboy at 7:17 AM on May 22, 2011


travis08: "Even if you knew the percentages of the odds of one letter following the next letter, that surely wouldn't be the way to make an adaptive keyboard. Seemingly what matters is what letters people normally type. Which would be based on what words people commonly use NOT on the odds using every word in existence"

Further, it doesn't really matter what letter commonly follows A in the entire English corpus if the previous letter you've typed were CALANDA.
posted by turkeyphant at 10:59 AM on May 22, 2011


You could build a database by installing a keylogger on your own machine. That way you'd get every key hit, including action keys, but you'd also get mistakes/mis-hits too...
posted by trialex at 3:49 PM on May 22, 2011


« Older If I can't think of a dance group name, why would...   |   How to Drink in Davenport Newer »
This thread is closed to new comments.