Word parts and/or rules for combining them (morphemes?)
November 21, 2015 9:09 AM   Subscribe

I want to write a program to generate new, realistic-sounding and -looking words. I want to programmatically create strings like 'bik', 'clible', 'aunstic', and 'cranoak', (if these words don't already exist), and avoid strings like 'bblejkm', 'aunstrbl', and other things that don't look pronounceable. Looking for a database of word parts to feed into this program, possibly with a set of accompanying rules. English or any other language (ideally with phonetic representations).

I do not mean combining existing words with known word endings (so, "eat" + "en" = "eaten" isn't interesting, and neither is "cat" + "s" = "cats").

Does there exist anywhere
- a list/database of word parts - phonemes, maybe morphemes - word roots, endings, prefixes, etc.
- and perhaps a companion rule set for combining these into realistic-sounding/spelling words

My examples (and thinking) is English-centric, but I'd love to have this for other languages.


I don't quite know enough about morphology/linguistics to be able to ask for this as intelligibly as I'd like, but here are some made-up examples that might communicate what kind of thing I'm hoping to uncover.


'spl' [appropriate at syllable beginnings, not appropriate at syllable endings]
'ee' [pronounced /i/] [appropriate at syllable beginnings only when spelled 'ea'] [appropriate mid-syllable when spelled 'i', 'ea', 'ee']


I'm not 100% opposed to trying to manufacture such a list -- although I'm woefully unqualified to do so -- or to making some kind of genetic algorithm to create something like this itself -- although, again, I have no experience with this. However, if some awesome linguists somewhere have figured out something clever already, I'd love to use that. It seems likely to me that this already exists, I just don't know how to find it.

I have some training in French and Russian. I have a vague understanding of how tonal languages work, but tonal languages are probably too complex for me to handle now.

My efforts to search for "how to make new words" and similar efforts have yielded information on compound words, ways that words are combined with endings and prefixes to change meaning in predictable ways, and sentence-level grammar rules -- these are not what I'm looking for.

Thank you in advance! This is part of a project that I've been interested in pursuing for years, and I'm finally taking it seriously; any and all advice, or suggestions for other resources (including specific linguistics departments with research in the neighborhood of this) would be extremely gratefully received.
posted by amtho to Writing & Language (19 answers total) 9 users marked this as a favorite
 
The canonical tool for this is the Markov chain generator, which analyzes a corpus (list of known words) for frequencies of letter pairs, and then randomly generates new words based on these frequencies. Here is an example: http://max.marrone.nyc/Markov-Word-Generator/. Can also be used to generate sentences: see Mark V Shaney; Garkov.
posted by kindall at 9:16 AM on November 21, 2015 [2 favorites]


In linguistics we talk about the difference between "impossible non-words" and "possible non-words" and there has been a lot of interesting study on this (i.e. different parts of the brain "light up" when presented with these different sorts of words) which might be a good place to read up on it.
posted by jessamyn at 9:18 AM on November 21, 2015 [9 favorites]


Conlang ("constructed language") is the term for the practice of making up languages from scratch for fun or for artistic purposes (the Klingon language in the Star Trek franchise, for example) which you could use as a search term to look for things like this.

Most of the software I found when I went looking for conlang-generator tools used English as a usage example and hence included the sort of database you're talking about. There were similar tools for generating character names in books and games. (IIRC I found a Windows 3.0 application, of all things, and managed to get it running.) I should note that none of the ones I found appeared to have been the product of linguists, but were rather amateurs and enthusiasts taking a shot at it.

The blog and web site of MeFi's own zompist are great resources for conlang-related stuff.
posted by XMLicious at 9:39 AM on November 21, 2015 [1 favorite]


Is your goal (1) figuring out for yourself how to make such a program, or (2) having the program itself?

Because if it's (2), there are quite a few databases and tools out there already; just google "Non-word generator".
posted by damayanti at 9:42 AM on November 21, 2015


"Phonotactics" is what you're looking for, I think--the rules for allowable combinations of phonemes in a language. And if you google phonotactics + generator, you'll find a lot of resources (zompist is first in my results).
posted by wintersweet at 9:42 AM on November 21, 2015 [3 favorites]


(I'll also note that looking into English phonotactics will only get you part of the way there, because English orthography doesn't have nice one-to-one correspondences between phonemes and graphemes.)
posted by damayanti at 9:44 AM on November 21, 2015 [2 favorites]


Another linguistics keyword to search on is "syllable structure."

Basically, every language has its own set of rules for what can make up a syllable. In some languages syllables are tiny: there are a lot of languages where all syllables are a single consonant followed by a single vowel and that's it. English is really unusual among languages for how big and complicated its syllables can get — like "strengths" or "fifths."

In a lot of languages, generating possible non-words is just a matter of generating a bunch of possible syllables and sticking them together. (In some languages it gets more complicated, because there are additional rules against putting some kinds of syllables next to each other. English is pretty relaxed about what kinds of syllables can go together, though, so making possible non-words by generating random English-y syllables and sticking them together is pretty reasonable.)
posted by nebulawindphone at 9:54 AM on November 21, 2015


That's ridiculously easy to do in hiragana. All you have to do is make sure the word doesn't start with ん. Aside from that, any string of hiragana morae will be pronounceable.
posted by Chocolate Pickle at 9:55 AM on November 21, 2015 [1 favorite]


(Oops, one other rule: you can't have more than two in a row of あいうえお, and it's probably safer to even avoid doubling them. So あい is legal but probably should be avoided on general principle.)
posted by Chocolate Pickle at 12:04 PM on November 21, 2015


Response by poster: To clarify, in case anyone is still interested:

My goal is to have this be part of a larger system to find words to represent to-be-defined new concepts.

I'd like to have a program that I can modify. I hope to figure out how to generate a set of possible non-words (thanks Jessamyn) that are likely to be appropriate to a given concept.

The Markov chain examples are promising and may be a good starting point. I'd hoped to find something more intentional.

What I'd love to find is a database of word parts (not a database of already-generated non-words) that I can somehow connect to existing defined words. Maybe something like:

- Define concept [e.g. the specific quality of movement of cats that triggers phobic or disgust responses in some people - there would be a lot more text and possibly videos to make it clear what was intended here and what wasn't];

- Get an input list of related words; they don't have to precisely match the concept [like: cat-like, slithery, agile, repulsive, creepy, silent, spider, sneaky, triangle head];

- Use some kind of meaning network to find more words connected in meaning (maybe some kind of Roget's thesaurus database): [arachnoid, hissing, herpetid, graceful, scaled, underhanded, wedge];

- Match word lists to database of word parts [this would make a big data structure of syllables and syllable parts; cat-like and slithery would make: cat, at, ca, like, lie, ike, sli, ith, ther, er, i, etc.]

- combine to generate a list of possible non-words;

- score non-words according to their similarity to existing words, length, complexity of spelling, and other criteria;

- present candidate non-words for evaluation by humans.


I have some education in linguistics, but it was a long time ago. Since many of the students with me were studying computer science at the time, the linguistics folks focused a lot on grammar, voice recognition, and speech generation, rather than phonetics and morphology, so my vocabulary is a bit lacking.


(I chose the cat-movement example because it's relatively easy for me to focus it; I'm actually interested in other things that are difficult to describe.)
posted by amtho at 12:29 PM on November 21, 2015


Hmm. I'm not sure I really understand what you're up to here, but if you want "word chunks" with specific sensory connotations it sounds like you might be interested in phonaesthemes.
posted by nebulawindphone at 1:17 PM on November 21, 2015 [1 favorite]


Research derivational morphology, phonotactics, and sound changes. Check out software tools for conlanging. Pay particular attention to Gen and SCA^2. Read comprehensive grammars for other existing languages, especially ones that you are not familiar with.
posted by Sticherbeast at 1:17 PM on November 21, 2015


"Sound symbolism" is also a big thing here, i.e. the idea many would think that "slibipip" sounds wet and slithery, where as "badumba" sounds heavy and awkward.
posted by Sticherbeast at 1:24 PM on November 21, 2015


Another important thing, especially for English: consider how many science/technology words tend to comprise bits of Latin/Greek roots, whereas words of Anglo-Saxon derivation often seem more earthy. Compare "feline feces" with "cat poop". What's going on here is not just about sounds qua sounds.
posted by Sticherbeast at 1:35 PM on November 21, 2015


Hey, I wrote a series of utilities that go as far as:

– Analyzing words and noting how often phonemes follow each other in the CMUdict corpus. You can use this to get a sense what sequences of phonemes are common in real language.

– Then, I have another Node module that uses that information to complete a phoneme sequence, given a partial sequence. It completes the sequence by taking the last phoneme then picking from the phonemes that have been observed following the last phoneme. Its choice is random, but weighted by the observed following frequencies. It keeps doing that until it finds a phoneme that is likely to end the word. So, here's an example program that uses the module:
var sequencer = require('phoneme-sequencer');

var seq = sequencer.completeSequence(
  {
    base: ['START', 'L'] // Does not have to include 'START' or 'END'
    boundary: 'syllable' // or 'word'
    seed: 800
  }
);

console.log(seq);
The output here is going be something like `['START', 'L', 'EH', 'K', 'S', 'END']. So we started with "L" and ended up with a sequence of phonemes that reads something like "leks".

The part I have not written is something that maps a sequence of phonemes back to a spelling. I do have something that can map a sequence of phonemes back to real words, but what you're looking for is the complement of that – mapping sequences back to non-existent words.

You could write that using these modules, or you can use them as a reference if the abstractions are not to your liking, or simply take an entirely different approach!

If you do have questions about these modules, feel free to ask. I favored getting an answer out over completeness, so there could be a lot of essential details left out.
posted by ignignokt at 1:50 PM on November 21, 2015 [2 favorites]


You might do some reading on the methods and thought processes of Lewis Carroll and other inspired nonsense poets from the days before algorithmic generation was practical, at least to inform your own algorithms. Here is a short discussion on the classic Jabberwocky.

In fact, just read everything ever written by Lewis Carroll, if you haven't already. ;)
posted by MoTLD at 3:31 PM on November 21, 2015


Response by poster: Thank you - these are really interesting directions for me to explore. I'm going to take some time and follow all this info, while refining my own plan for how my project is actually going to work.

ignignokt - What was the motivation for making your utilities? It's an intriguing set of tools, and I'm wondering what you are using them for. (Are they in Javascript?)


about conlang - I did look into this stuff a little, but it doesn't seem to fit my needs. Conlang seems to be about you, a new language's creator, coming up with a set of rules and inputs that you like (for whatever reason), and then working with those rules to make a new whole language. I'm more interested in identifying pre-existing (and probably bigger and more complex than I alone could invent) construction rules and inputs, or perhaps a complete set of possible rules.

I looked briefly at some of the conlang links in the (very kind) answers above, but didn't yet see stuff exactly right for my needs. People seem so enthusiastic about recommending conlang, though, that I'm going to keep looking in case I'm missing something.


Thanks for the vocabulary, too, everyone!
posted by amtho at 3:37 PM on November 21, 2015


Take a closer look at some of the conlang materials, especially the books by Mark Rosenfelder. Trying to create new, realistic-seeming words that hew to a set of preexisting rules is exactly what one does in any halfway decent conlang. In order to do that, then you need an understanding of how derivational morphology, phonotactics, etc. actually work across languages. That's why so many of the better tools, books, etc. about conlangs incorporate so much real world information about linguistics. The fact that the "rules" of English were not invented by you does not matter very much.
posted by Sticherbeast at 7:51 PM on November 21, 2015


What was the motivation for making your utilities? It's an intriguing set of tools, and I'm wondering what you are using them for. (Are they in Javascript?)

They are in JavaScript.

I used them to make a module for finding loose rhymes (where "loose rhymes" is sorta explained in that README), a module that uses phonemes to find homophones, and a (work in progress) bot that mishears famous quotes by finding soundalikes.

I have a lot of ideas that depend on finding similar or related phoneme sequences. (This reminds me that I should actually get back to working on them instead of stopping in the middle of the marathon!)

Basically, my motivation is similar to yours and others in this space that want to see strange transformations of language. Come to think of it, you may want to get in touch with the author of Every Non-Word, which recombines real syllables to make new words.
posted by ignignokt at 12:25 PM on November 22, 2015


« Older The misanthropic novel   |   Holiday cards in Orange County, CA? Newer »
This thread is closed to new comments.