What is the best (human) language for text-to-speech algorithms?
March 2, 2010 3:18 PM   Subscribe

What is the best (human) language for text-to-speech algorithms, that is, the one that sounds least obviously robotic and artificial when synthesized? Are some languages regarded as qualitatively easier to make text-to-speech sound good?

I imagine that such a language would have some of the following characteristics:
- highly regularized spelling (or ideograms)
- short words (so they're easy to piece together from syllables)
- the expected inflection (via tone or pauses) of a speaker over the course of a sentence is minimal

Has any research been done in this area? It would be cool if there were some language in which synthesized voices sounded practically as good as people.
posted by dfan to Technology (17 answers total) 5 users marked this as a favorite
 
My understanding is that Apple's Alex is the top of the line in TTS today. It sounds much more natural than any other synthesized voice I've heard. I don't have any links to research but you can find plenty of samples of Alex speaking online, and probably some papers about how it works.
posted by The Winsome Parker Lewis at 3:33 PM on March 2, 2010


I would be that English has probably had the most research and work done on it, and probably sounds best.

I don't think short words and syllables would make things too much easier, because a computer can just store a complete pronunciation 'guide' for every word in the language.

The real trick isn't making the words sound correct, it's the way the words are enunciated, the cadence, the emphasis, the pauses. Where and how to draw out words, where to raise and lower pitch (Even in tonal languages people still use pitch for extra information, I'm sure).

What makes humans sound human isn't pronouncing the words correctly it's the way they convey an emotional connection. A person could literally say "blah Blah BLAH blah BLAH" and have it mean something to a listener.
posted by delmoi at 3:37 PM on March 2, 2010


More about Alex: This page explains how the algorithm analyzes each paragraph at a time, instead of on a per-word basis. Alex also sings Depeche Mode. Since he lost the ability to speak, Roger Ebert has been using Alex to communicate through his laptop, but here's some late-breaking news (as of this afternoon)... a company called CereProc has developed a new custom voice for him, based on audio from his many years recordings for TV. It's really Roger Ebert's voice, and it sounds great! Exciting stuff.
posted by The Winsome Parker Lewis at 4:02 PM on March 2, 2010 [1 favorite]


I must've put too many links in one comment. The bold "a new custom voice" text is supposed to link here.
posted by The Winsome Parker Lewis at 4:04 PM on March 2, 2010


Response by poster: The Alex is stuff is interesting, but what I specifically want to know is what human language works best for text-to-speech algorithms, not what text-to-speech algorithm is best for English.

The real trick isn't making the words sound correct, it's the way the words are enunciated, the cadence, the emphasis, the pauses. Where and how to draw out words, where to raise and lower pitch (Even in tonal languages people still use pitch for extra information, I'm sure).

Right, that's what I meant by my third bullet point - are there languages where that is less necessary, and thus a synthesized voice would sound less artificial?
posted by dfan at 4:46 PM on March 2, 2010


TWPL: I had Alex speak your answer. "He" got uncannily excited about saying "Depeche Mode".

(On a Mac: select paragraph, right click, Speech -> start speaking. The last two sentences also stood out, coming close to emulating emotion.)

dfan: Morse code? Seriously, though, on a grand spectrum, I wonder how far English really is from the "totally atonal" extreme. IANALinguist; I'm sure obscure languages with two vowels and fifteen consonants are closer.
posted by supercres at 4:52 PM on March 2, 2010


(I'm taking for granted -- erroneously? -- that "totally atonal" would be easier to synthesize, and that more tonal would mean that pitch would be more likely to change literal meaning instead of just feeling and emphasis.)
posted by supercres at 4:55 PM on March 2, 2010


short words (so they're easy to piece together from syllables)

The trouble with this is that neighboring sounds interact even across word boundaries. The second "d" in "did you" sounds a little different than the second "d" in "did he," because in the former your tongue is already moving into position to pronounce the "y." Piecing together a quick sequence of short words can be just as difficult as piecing together a sequence of syllables within a word.

the expected inflection (via tone or pauses) of a speaker over the course of a sentence is minimal

Changes in pitch are actually not very hard to synthesize if you know where they ought to go. The trouble, in English, is getting your computer to guess where the changes in pitch ought to go. The intonation of English sentences is very important, and very easy for human speakers to produce, but very hard for a computer to predict with any degree of accuracy. This is something I'm doing research on right now, actually. It's fascinating, but it's also a royal pain in the ass.

So you're not necessarily looking for an "atonal" language. (I don't think such a language exists anyway — not in the sense of "language which doesn't use pitch for any purpose at all.") You're looking for a language where pitch is highly predictable from the words on the page. A language like Chinese or Yoruba with lexical tone might well be easier in this particular area, because in languages like that pitch is closer to being predictable.

In fact, what you're really looking for is a language where everything about the sound system is highly predictable from the words on the page. My suspicion is that you're gonna hit a wall pretty quickly on that search. Once you eliminate languages with stupid unnecessary obstacles to phonological predictability (like the broken orthography we've got for English), you'll be left with the necessary unpredictability that shows up in any language due to factors like the ones delmoi points out — emotion, emphasis, the ebb and flow of conversation, and so on. You're not gonna find a language without those.
posted by nebulawindphone at 5:57 PM on March 2, 2010


I would guess Korean, since a lot of the pronunciation is built into the way it's written. Written Korean is unique in the world in that it was invented by scholars & not something that evolved through cultural exchange & incremental innovation.

I heard (rumored, semi-confirmed by a Korean girl I know) that a foreign word written in Korean can be pronounced almost flawlessly by a Korean seeing that word for the first time. The only problem is Korean doesn't have the letter "f" (just like Japanese doesn't have the letter "r"), and a lot of their consonants are softer/different than ours. Which is why the common last name "Park" in English is sometimes translated as "Bak."

I would think Chinese would be difficult because the written language is symbolic & not phonetic at all. I heard (rumored) that there are so many dialects in China that

a) the newscasters speak a special newscaster dialect
b) television shows there have Chinese subtitles so people can understand what's being said on the TV.

English is largely phonetic, but with a bunch of legacy rules that don't make sense. Most European languages fall into the same boat as English - sort of phonetic, but with a lot of legacy rules.
posted by MesoFilter at 11:27 PM on March 2, 2010


Some dialects of Chinese rely on pitch so precisely that almost all the people who speak it fluently have perfect pitch.

You can get a dozen people into a studio on separate days to say a phrase & they'll match pitch exactly. You can get them in a year later and they'll match pitch exactly.

If Chinese had a phonetic written version (maybe it does, I don't know), I thought the Japanese were the one with multiple written languages, then perhaps it could be highly reliably produced, provided the pitch isn't too dependent on context.
posted by MesoFilter at 11:29 PM on March 2, 2010


(Another anecdote - linguists who traveled to these regions of China almost always insult people unwittingly because the words for "mother" and "dog" have same pronunciation, but with different pitches... I don't think pitch is notated in the guidebooks, since there is no common system for notating pitch beyond the musical one).
posted by MesoFilter at 11:31 PM on March 2, 2010


Awesome question. I'm curious if there's really a true answer to this. The popularity of English means that it has probably had the most work put into it and can be spoken the best by text-to-speech programs, like delmoi suggests. But from my limited experience with other languages and their corresponding representational systems, I would guess that it may be one of the WORST to try and do right.

Even Japanese with its insane writing system would be easier because the written language is more regular, fundamentally, and you just don't have the kinds of written exceptions like you do in English. Kanji can be converted into Hiragana, which is consistent. Hiragana contains the same sounds as Katakana. So you're really only dealing with one simple set of consistent sounds, and the Kanji in fact make it easier to determine differences between homonyms, word boundaries and whatnot. I wouldn't be surprised if Japanese was one of the simpler languages, in the end, to produce a text-to-speech program for.

Another random note: my co-worker who speaks Hungarian, just today, read some Hungarian out loud for me, and told me that each vowel sound is notated exactly in Hungarian. She told me her father used to complain about the irregularities in written English all the time. I mean, "ought" vs. "enough?" WTF. Insert obligatory George Bernard Shaw reference here.

MesoFilter, your point about Chinese is well taken, but I wonder: once you decide that you are going to interpret the written language in front of you as Mandarin or Cantonese or whatever, doesn't your point become moot?

Um...sorry I'm providing anything definitive here. I'd love to know the answer to this though.
posted by dubitable at 9:33 AM on March 3, 2010


Edit:

...make it easier to determine differences between homonyms, word boundaries and whatnot.

...probably it would be more accurate to say...

...make it easier to determine differences between homophones, word boundaries and whatnot.
posted by dubitable at 9:35 AM on March 3, 2010


I would think Chinese would be difficult because the written language is symbolic & not phonetic at all. I heard (rumored) that there are so many dialects in China that

There's a phonetic system for Chinese called Pin-Yin, and it's completely phonetically stable. (the word is pronounced as it's spelled, using roman letters, even) And actually Chinese would be pretty easy, probably easier then English. With English, there are lots of ways to pronounce a given string of letters. In Chinese, there's only really one way to pronounce a character, although sometimes the tone will differ depending on what it means in that sentence, very rarely you get different pronunciations. For example the character '大' which means 'big' sounds like 'da', but the word 'doctor' is '大夫' which is pronounced 'dai fu'.

But anyway, it's rare. Almost always 1 symbol = 1 sound.

(People who don't know anything about Chinese probably shouldn't speculate about how easy or difficult different things are -- recognizing the characters wouldn't be hard for a computer at all. And if you know a character, you can say it)
posted by delmoi at 6:55 PM on March 6, 2010


Another anecdote - linguists who traveled to these regions of China almost always insult people unwittingly because the words for "mother" and "dog" have same pronunciation, but with different pitches I don't think pitch is notated in the guidebooks, since there is no common system for notating pitch beyond the musical one

Okay, and that's completely insane. First of all, the method for noting pitch is totally standardized. There are 4 tones and they're numbered. So for "Mother" you write "Ma3" or "mǎ "

Also the word for "dog" in Chinese is "Gou" with the third tone. Which, by the way, sounds nothing like the word for mother.
posted by delmoi at 7:03 PM on March 6, 2010


Okay, and that's completely insane. First of all, the method for noting pitch is totally standardized. There are 4 tones and they're numbered. So for "Mother" you write "Ma3" or "mǎ "

I know you were responding to MesoFilter, but I don't understand...aren't there a number of different types of Chinese which all use the same writing system, and isn't it the case that they have different amounts of tones and whatnot, and the same writing system can sound different depending on whether we're talking about Cantonese or Mandarin? Aren't you describing Mandarin? Also...more questions: simplified Chinese writing vs. Traditional? Discuss (please).
posted by dubitable at 11:24 AM on March 8, 2010


dubitable: There are lots of different dialects of Chinese which all use the same characters. In fact, that was common in a lot of countries including Japan, Korea, Vietnam laos, etc. Just like lots of European countries use Cyrillic or Latin characters. Korea switched to Hangul, Vietnam was switched to roman letters by the french, and Japan has the katakana and hiragana syllabary

Anyway, different dialects are different languages as far as a computer is concerned. Programming a computer to speak "Chinese" means making it be able to speak mandarin or Cantonese or whatever other dialect you want. You can't say a computer can't speak english because it can't also speak Welsh or Scotch whatever.
posted by delmoi at 11:59 AM on March 9, 2010


« Older Hiring Fair   |   Assessments on Learning Style Newer »
This thread is closed to new comments.