Join 3,411 readers in helping fund MetaFilter (Hide)

June 20, 2011 8:19 PM   Subscribe

Is there a linguistics term for glued-together Twitter hashtags, such as #vacationwishlist, #isawesome, and #wordsthatdescribeme? Also, is there a good way of splitting these into individual words?

The closest I could find is agglutination, but that doesn't seem to quite capture these kinds of hashtags.

Also, are there any good software libraries (or robust algorithms) for splitting these kinds of hashtags into likely words (e.g. "wordsthatdescribeme" -> {words, that, describe, me})? I'm looking for better techniques than just brute-forcing it against a dictionary, or a smarter way of brute-forcing it.
posted by jasonhong to Computers & Internet (9 answers total) 3 users marked this as a favorite
I don't think there is a linguistics term that covers this exactly. (I am a linguist).

The dictionary approach is the only obvious one that occurs to me, but one place to look for alternatives would be algorithms used in predictive texting.
posted by lollusc at 8:47 PM on June 20, 2011

(IANA computational linguist, but) N-gram probabilities? Using your example, no words end with -ordst, but plenty end with -ords. Since I am not a computational linguist, I don't know where you can find an existing set of n-gram probabilities, but you might.
posted by Nomyte at 10:04 PM on June 20, 2011

The problem is in some senses insoluble because "wordthatdescribemeatonce" might either be "words that describe me at once" or "words that describe meat once". You would still need some human intervention to decide which of the possibilities was the right one.

You might be interested in this Stack Overflow question which addresses essentially the same problem.
posted by AmbroseChapel at 10:41 PM on June 20, 2011

I don't know anything about anything, but would concatenation be closer to what you're looking for?
posted by Comic Sans-Culotte at 10:41 PM on June 20, 2011

"wordthatdescribemeatonce" would very likely not be used for this very reason in a "true" twitter hashtag.
posted by Precision at 10:45 PM on June 20, 2011

Another complication is txt-speak and even misspellings in hashtags. A current trending tag: #signsuasidechick. I don't even know how that's supposed to be read.
posted by WasabiFlux at 10:48 PM on June 20, 2011

I think I would just call it "compounding".
posted by cider at 3:30 AM on June 21, 2011

Another complication is txt-speak and even misspellings in hashtags. A current trending tag: #signsuasidechick. I don't even know how that's supposed to be read.
posted by WasabiFlux at 1:48 AM on June 21 [+] [!]

Signs (that) U(You are) A Side Chick
posted by Julnyes at 10:44 AM on June 21, 2011

The computational task you want to perform is called "word segmentation". If you don't care too much about handling ambiguous cases correctly, the brute-force approach from StackOverflow is probably your best bet.

The next best thing would be to use 1-gram counts from the Google 5-gram Corpus to resolve ambiguities. Take all your possible segmentations, find the counts of all the words, and multiply the word counts to get a score for each segmentation. The highest-scoring segmentation will usually be the best.

If you need to do the best possible job, the LingPipe toolkit uses has state-of-the-art performance for Chinese text, and I suspect you could make it work for English text by replacing the Chinese data files with data derived from the Google Corpus. The downside of something like LingPipe is that the learning curve is likely to be steep if you're not already well-versed in natural language processing.
posted by shponglespore at 2:43 PM on June 21, 2011

« Older Xbox 360, Elite, Falcon, Arcad...   |  I am built… like an accordion.... Newer »
This thread is closed to new comments.