How slow is my voice computer?
March 15, 2010 4:11 AM Subscribe
geek alert! If two humans were to speak to each other using only 1's and 0's, is it possible to determine a baud rate?
Back in the old days I had a 300bps cradle modem, now we have broadband. I am just curious as to how ineffective human voice communication is.
Back in the old days I had a 300bps cradle modem, now we have broadband. I am just curious as to how ineffective human voice communication is.
Literally just saying the words, I can speak about four "one"s and "zero"s per second, which I believe is simply 4Bd.
It looks like the world record holder for fastest speaker is Steve Woodmore (near the bottom of the page), who speaks at about 600 words per minute, or 10Bd.
posted by lucidium at 4:49 AM on March 15, 2010
It looks like the world record holder for fastest speaker is Steve Woodmore (near the bottom of the page), who speaks at about 600 words per minute, or 10Bd.
posted by lucidium at 4:49 AM on March 15, 2010
20.3 syllables per second is the current record held by rapper Rebel XD. I'll leave it to someone else to figure out the average ratio of 1's (1 syllable) to 0's (2 syllables) in binary translations of english words in common usage. For the sake of this hypothetical, let's replace the word "zero" with "naught" and skip right past that problem. We'll also skip right past the idea that someone can comprehend that many syllables per second, because after all, they're already speaking binary so why not?
Answer: roughly 20 bits per second, optimally.
posted by empyrean at 5:05 AM on March 15, 2010
Answer: roughly 20 bits per second, optimally.
posted by empyrean at 5:05 AM on March 15, 2010
Literally just saying the words, I can speak about four "one"s and "zero"s per second, which I believe is simply 4Bd.
But then you wouldn't choose to use burdensome words like "one" and "zero". Simple vowels ("ee" for one and "o" for zero) might be faster. 6 kbaud for Speex seems too much for just conveying the textual content -- 5 words/sec at 8 chars/word, 8 bits/char would by 320 Bd.
posted by gijsvs at 5:10 AM on March 15, 2010
But then you wouldn't choose to use burdensome words like "one" and "zero". Simple vowels ("ee" for one and "o" for zero) might be faster. 6 kbaud for Speex seems too much for just conveying the textual content -- 5 words/sec at 8 chars/word, 8 bits/char would by 320 Bd.
posted by gijsvs at 5:10 AM on March 15, 2010
There are two distinct problems here.
The first is to do with the voice as a means of transmitting arbitrary digital information. You could speak binany, hex, decimal, a specific set of made-up sounds, or whatever you like, and the human modem at the other end of the conversation would receive that and pass it on to be decoded into text, a GIF image, or whatever. For that a baud rate is calculable.
The second problem is to do with efficiency 'human voice communication'. As teraflop hints, there is much more information in human conversation than plain text. And how you measure that information in terms of bits and bytes is probably somewhat contentious. Anyway, to get all of that information across your network, you'd either have to use some data format that can accurately encode every nuance of speech along with the plain text, or else just send the compressed audio.
posted by le morte de bea arthur at 5:50 AM on March 15, 2010
The first is to do with the voice as a means of transmitting arbitrary digital information. You could speak binany, hex, decimal, a specific set of made-up sounds, or whatever you like, and the human modem at the other end of the conversation would receive that and pass it on to be decoded into text, a GIF image, or whatever. For that a baud rate is calculable.
The second problem is to do with efficiency 'human voice communication'. As teraflop hints, there is much more information in human conversation than plain text. And how you measure that information in terms of bits and bytes is probably somewhat contentious. Anyway, to get all of that information across your network, you'd either have to use some data format that can accurately encode every nuance of speech along with the plain text, or else just send the compressed audio.
posted by le morte de bea arthur at 5:50 AM on March 15, 2010
So to figure this out, we'd really ought to just want to measure how much information two people can communicate through voices, and then convert that into some sort of number of bits of information per second.
floam is on the right track. You asked about baud rate, you got answers about bit rate.
Baud rate is defined as symbols per second, not bits per second. Symbols represent are phase and/or amplitude changes (i.e., the modulation) of the communication channel.
Symbols represent bit(s) which represent the underlying information so it would depend on how you define human speech in terms information rate and how many bits are needed to precisely convey that information.
posted by three blind mice at 5:52 AM on March 15, 2010 [2 favorites]
floam is on the right track. You asked about baud rate, you got answers about bit rate.
Baud rate is defined as symbols per second, not bits per second. Symbols represent are phase and/or amplitude changes (i.e., the modulation) of the communication channel.
Symbols represent bit(s) which represent the underlying information so it would depend on how you define human speech in terms information rate and how many bits are needed to precisely convey that information.
posted by three blind mice at 5:52 AM on March 15, 2010 [2 favorites]
Best answer: I think the answer is "As low a bitrate as you can format deciperable video."
Human communication is insanely efficient, and what you say is widely reputed to be a mere fraction of what you communicate. Most of that comes from how you say things, which includes not only tone of voice and inflection, but a host of other subtle body language clues and even where the conversation is happening. The idea that the full richness of human communication can be effectively reduced to a purely digital medium is just silly. Even the written word lacks some of the vitality of face-to-face interaction, though it picks up a few nuances of its own.
I'm not just talking about adding "channels" to the main data stream. I'm talking about shades of meaning, combinations of emotion, uncertainty, trepidation, irony, intimacy, the whole gamut of personal touches that make personal interaction, indeed, friendship itself, as amazing as it can be. None of these things can be reduced easily to a digital format, if they can be reduced at all.
And the thing of it is, a lot of this depends on the listener, not the speaker, communication is, as might be guessed from the word itself, something of a communal activity. Exactly the same speech can produce markedly different reactions in different places at different times with exactly the same audience. How something affects you is based partly on what the speaker is saying, but also on your own experience. An offhand remark not especially significant in its own right can have significant emotional repercussions if it was a favorite remark of a lost loved one, for example. And words and phrases take on cultural baggage over time, which a skillful communicator can not just tap into, but signal to his audience that he is tapping into this potenially vast store of unspoken meaning without actually saying anything to that effect. How do you encode that?
So I come again to my original answer: to truly digitize all that is being communicated in human speech requires a video camera. The lowest H.264 bitrate is 64kbps, but I'd think some multiple of that would be a better guess if you wanted to be sure you were getting everything. Even then you're probably going to be missing subtlties you'd pick up if you were in the same room with someone.
posted by valkyryn at 6:01 AM on March 15, 2010
Human communication is insanely efficient, and what you say is widely reputed to be a mere fraction of what you communicate. Most of that comes from how you say things, which includes not only tone of voice and inflection, but a host of other subtle body language clues and even where the conversation is happening. The idea that the full richness of human communication can be effectively reduced to a purely digital medium is just silly. Even the written word lacks some of the vitality of face-to-face interaction, though it picks up a few nuances of its own.
I'm not just talking about adding "channels" to the main data stream. I'm talking about shades of meaning, combinations of emotion, uncertainty, trepidation, irony, intimacy, the whole gamut of personal touches that make personal interaction, indeed, friendship itself, as amazing as it can be. None of these things can be reduced easily to a digital format, if they can be reduced at all.
And the thing of it is, a lot of this depends on the listener, not the speaker, communication is, as might be guessed from the word itself, something of a communal activity. Exactly the same speech can produce markedly different reactions in different places at different times with exactly the same audience. How something affects you is based partly on what the speaker is saying, but also on your own experience. An offhand remark not especially significant in its own right can have significant emotional repercussions if it was a favorite remark of a lost loved one, for example. And words and phrases take on cultural baggage over time, which a skillful communicator can not just tap into, but signal to his audience that he is tapping into this potenially vast store of unspoken meaning without actually saying anything to that effect. How do you encode that?
So I come again to my original answer: to truly digitize all that is being communicated in human speech requires a video camera. The lowest H.264 bitrate is 64kbps, but I'd think some multiple of that would be a better guess if you wanted to be sure you were getting everything. Even then you're probably going to be missing subtlties you'd pick up if you were in the same room with someone.
posted by valkyryn at 6:01 AM on March 15, 2010
Best answer: le morte de bea arthur is exactly right. Baud rate is the number of symbols per second. Symbols is a bit of an arbitrary definition, for voice we can define it to be words, syllables, and even phonemes.
A simplified example at calculating the bit rate of a "speech modem"*: Take the 20.3 syllables per second rate empyrean referenced. The next step is to determine the number of syllables available, which from here* is more than 170,000. If we're not worried about words, and can arbitrarily assign syllables to bit representations, then we can just take the log base 2 of 170,000 which is about 17. In other words, each symbol can encode 17 bits. Which gives you 345 bits per second.
So the original post is about 263 bytes (uncompressed), which is 2104 bits, so it would take about 6 seconds to transmit this post in speech binary, which is roughly the same speaking the post in regular English (albeit very quickly). It's not efficient because we haven't compressed anything. In other words, in your post you have 5 syllables that match "to", but only 1 matching "geek". This creates inefficiencies because each syllable has an equal number of bits. So in a phrase of "two to too to too geek" would require 102 bits. We could get better efficiencies if we used a smaller bit size for the "to" symbol, say we used 3 bits for "to" because it occurs so often, and 25 bits for "geek" because it's relatively rare. Now the phrase could be transmitted in 40 bits. This is what compression algorithms do, and there are plenty out there. Let's say we compress the original post text to 10% of its original size before using our voice modems to send it. Now we have 210 bits of information, which can be sent in 0.6 seconds. Now we're talking about some efficiencies. So if we had a compression algorithm that could compress text by 90% (not a stretch at all), and then used the syllable modem to transmit our text, we could achieve around 3,450 bits per second of *information* (text). Our "modem" speed is still just 345 bits per second though, so it would take quite a while to transfer an jpeg image or other compressed data.
You'd get different answers for words and phonemes, you would probably get the best result using phonemes, but I don't know hot to determine how many phonemes one could utter in a second.
* I have no idea if the number of syllables that WikiAnswer had is accurate or not, so don't use the information here as authoritative, only illustrative.
posted by forforf at 7:41 AM on March 15, 2010
A simplified example at calculating the bit rate of a "speech modem"*: Take the 20.3 syllables per second rate empyrean referenced. The next step is to determine the number of syllables available, which from here* is more than 170,000. If we're not worried about words, and can arbitrarily assign syllables to bit representations, then we can just take the log base 2 of 170,000 which is about 17. In other words, each symbol can encode 17 bits. Which gives you 345 bits per second.
So the original post is about 263 bytes (uncompressed), which is 2104 bits, so it would take about 6 seconds to transmit this post in speech binary, which is roughly the same speaking the post in regular English (albeit very quickly). It's not efficient because we haven't compressed anything. In other words, in your post you have 5 syllables that match "to", but only 1 matching "geek". This creates inefficiencies because each syllable has an equal number of bits. So in a phrase of "two to too to too geek" would require 102 bits. We could get better efficiencies if we used a smaller bit size for the "to" symbol, say we used 3 bits for "to" because it occurs so often, and 25 bits for "geek" because it's relatively rare. Now the phrase could be transmitted in 40 bits. This is what compression algorithms do, and there are plenty out there. Let's say we compress the original post text to 10% of its original size before using our voice modems to send it. Now we have 210 bits of information, which can be sent in 0.6 seconds. Now we're talking about some efficiencies. So if we had a compression algorithm that could compress text by 90% (not a stretch at all), and then used the syllable modem to transmit our text, we could achieve around 3,450 bits per second of *information* (text). Our "modem" speed is still just 345 bits per second though, so it would take quite a while to transfer an jpeg image or other compressed data.
You'd get different answers for words and phonemes, you would probably get the best result using phonemes, but I don't know hot to determine how many phonemes one could utter in a second.
* I have no idea if the number of syllables that WikiAnswer had is accurate or not, so don't use the information here as authoritative, only illustrative.
posted by forforf at 7:41 AM on March 15, 2010
Assuming that you don't mean "how much meaning could we pack into the words "one" and "zero" using tone, inflection, timing, etc, and really mean how quickly could people verbally communicate using only two symbols... I think the limit on the transmission end would be limited by the frequenc and voice skills used (an opera soprano could probably modulate more data than me). The receiving end, if it had to be realtime, would depend on the transcribing ability of the receiver - I think that would be your limiting factor. (I believe it's possible to speak faster than people can write/type/etc). Assuming some type of savant ability on the receiving end.........
Hm. I have to go to work now - this is a fun question.
posted by TravellingDen at 8:04 AM on March 15, 2010
Hm. I have to go to work now - this is a fun question.
posted by TravellingDen at 8:04 AM on March 15, 2010
Would Morse Code count as two humans "speaking" to each other in a binary way? Because that can go really fast between trained operators.
posted by Eyebrows McGee at 8:36 AM on March 15, 2010
posted by Eyebrows McGee at 8:36 AM on March 15, 2010
Symbols is a bit of an arbitrary definition, for voice we can define it to be words, syllables, and even phonemes.
"symbol" is not an arbitrary definition. In communication theory it is the rate of the phase and/or amplitude changes of a carrier signal(s) as I explained above. What those phase and amplitude changes represent depends on the modulation chosen. In a communication channel, symbols per second are generally less than bits per second (although the opposite can be true).
"so it would take about 6 seconds to transmit this post in speech binary,"
What do you mean by speech binary? And using what sort of modulation? At the same baud rate, i.e., the same number of symbols per second, a GMSK modulation would transit 2 bits of information for each phase change. At the same baud rate, i.e., the same number of symbols per second, a higher order modulation like 16-QAM quadrature modulation (such as is used in EDGE) would transmit 4 bits of information.
posted by three blind mice at 8:45 AM on March 15, 2010
"symbol" is not an arbitrary definition. In communication theory it is the rate of the phase and/or amplitude changes of a carrier signal(s) as I explained above. What those phase and amplitude changes represent depends on the modulation chosen. In a communication channel, symbols per second are generally less than bits per second (although the opposite can be true).
"so it would take about 6 seconds to transmit this post in speech binary,"
What do you mean by speech binary? And using what sort of modulation? At the same baud rate, i.e., the same number of symbols per second, a GMSK modulation would transit 2 bits of information for each phase change. At the same baud rate, i.e., the same number of symbols per second, a higher order modulation like 16-QAM quadrature modulation (such as is used in EDGE) would transmit 4 bits of information.
posted by three blind mice at 8:45 AM on March 15, 2010
Best answer: I am just curious as to how ineffective human voice communication is.
Well, its inefficient because youre using an ineffecient system in your hypothesis. We have language and words with meaning because we're not computers. We dont need to spell out each letter, we say an entire word in one or two syllables.
For instance, lets say I say "fish." This is one syllable. You instantly understand what I am saying.
For a modem to pass that info in binary its:
01100110011010010111001101101000
Which is 35 characters. So lets say you do "eeee" and "ooo" its 35 syllables when the English encoding was just 1. Why use non-language encodings? We're not robots.
Also, a modem does its own compression. So it might be something like
01x1x110x10y111001110110x0
by substituting x for 100 and y for 101. Lets substitute z for 111
01x1x110x10yz00z0110x0
Now a for 110:
01x1xax10yz00z0ax0
Now b for 01 and c for 00 and d for x0 and e for 10
bx1xaxeyzcz0ad
Ta da! Now we're down to 1/3rd the syllables! Still nowhere as efficient as just saying "Fish." So as you can see you'll really need to dig deeper into the ideas of language as compression to get the proper bit rate. Turns out human language is a very data dense. At least 30 bits per syllable/second considering my example of the word fish compared to a non-compressed binary encoding. Regardless, this is a silly question. How humans and computers communicate is very, very different.
posted by damn dirty ape at 9:07 AM on March 15, 2010
Well, its inefficient because youre using an ineffecient system in your hypothesis. We have language and words with meaning because we're not computers. We dont need to spell out each letter, we say an entire word in one or two syllables.
For instance, lets say I say "fish." This is one syllable. You instantly understand what I am saying.
For a modem to pass that info in binary its:
01100110011010010111001101101000
Which is 35 characters. So lets say you do "eeee" and "ooo" its 35 syllables when the English encoding was just 1. Why use non-language encodings? We're not robots.
Also, a modem does its own compression. So it might be something like
01x1x110x10y111001110110x0
by substituting x for 100 and y for 101. Lets substitute z for 111
01x1x110x10yz00z0110x0
Now a for 110:
01x1xax10yz00z0ax0
Now b for 01 and c for 00 and d for x0 and e for 10
bx1xaxeyzcz0ad
Ta da! Now we're down to 1/3rd the syllables! Still nowhere as efficient as just saying "Fish." So as you can see you'll really need to dig deeper into the ideas of language as compression to get the proper bit rate. Turns out human language is a very data dense. At least 30 bits per syllable/second considering my example of the word fish compared to a non-compressed binary encoding. Regardless, this is a silly question. How humans and computers communicate is very, very different.
posted by damn dirty ape at 9:07 AM on March 15, 2010
Best answer: tbm, in communication theory symbols are completely arbitrary. A symbol is nothing more than something that is agreed upon (between sender and receiver) to hold a certain amount of information (which can include bits).
In many analog communications schemes, you are correct, and if you notice that in 16-QAM you have 16 possible symbols. 16 Log 2 = 4 bits of information for each symbol.
We could create an encoding scheme of anything, say I choose A, $, *, and p as my symbols.
Using this method, "A*$$pA*$" is a valid encoding, and one would be able to decode exactly what I sent using that method. So, although you are right that frequency based encoding methods (GMSK, QPSK, 16-QAM, etc) use phase/amplitude changes to create a constellation of symbols, with each symbol providing the encoding for the digital signal, that is not the entire universe of encoding. High compression speech codecs, for example, use other techniques (like mapping speech patterns to a code-book of patterns).
Getting back to the OP question and what I meant by "speech binary"; by "speech binary" I only meant the agreed upon mechanism that would be used to use speech to encode digital data. As I've tried to explain, there is not only one way to do this, the only criteria (to dive deeper than is necessary) is that the decoding mechanism can reliably decode the original signal. So, hypothetically one could use phase and frequency of the voice for encoding the digital data into the typical constellation diagram, but I don't know that would be the most efficient method *for human vocals/speech*. Using pitch and volume is an interesting way for encoding, but it seemed to me decoding it would seem awfully error prone (i.e., low SNR). A better method (from the decoding and maintaining reasonably high SNR) would seem to use the mechanisms that humans are already wired for (words, syllables or phonemes). Those would seem to me to much more robust for decoding without error by a human receiver.
I found this question fun since I've had to deal with communication theory for most of my life and this was kind of a neat way to think about all the different ways of approaching and solving the problem. I happen to like my approach, but its not the only one, but it was one I could quantify and contemplate in an end to end mode.
On preview: I don't think it's a silly question at all. It may not be the most efficient method, but it's a fun thought experiment and it could even be fun to implement. It's also an excellent question for bringing out some of the issues and confluences of digital communication theory, information theory and speech coding.
posted by forforf at 10:00 AM on March 15, 2010
In many analog communications schemes, you are correct, and if you notice that in 16-QAM you have 16 possible symbols. 16 Log 2 = 4 bits of information for each symbol.
We could create an encoding scheme of anything, say I choose A, $, *, and p as my symbols.
Using this method, "A*$$pA*$" is a valid encoding, and one would be able to decode exactly what I sent using that method. So, although you are right that frequency based encoding methods (GMSK, QPSK, 16-QAM, etc) use phase/amplitude changes to create a constellation of symbols, with each symbol providing the encoding for the digital signal, that is not the entire universe of encoding. High compression speech codecs, for example, use other techniques (like mapping speech patterns to a code-book of patterns).
Getting back to the OP question and what I meant by "speech binary"; by "speech binary" I only meant the agreed upon mechanism that would be used to use speech to encode digital data. As I've tried to explain, there is not only one way to do this, the only criteria (to dive deeper than is necessary) is that the decoding mechanism can reliably decode the original signal. So, hypothetically one could use phase and frequency of the voice for encoding the digital data into the typical constellation diagram, but I don't know that would be the most efficient method *for human vocals/speech*. Using pitch and volume is an interesting way for encoding, but it seemed to me decoding it would seem awfully error prone (i.e., low SNR). A better method (from the decoding and maintaining reasonably high SNR) would seem to use the mechanisms that humans are already wired for (words, syllables or phonemes). Those would seem to me to much more robust for decoding without error by a human receiver.
I found this question fun since I've had to deal with communication theory for most of my life and this was kind of a neat way to think about all the different ways of approaching and solving the problem. I happen to like my approach, but its not the only one, but it was one I could quantify and contemplate in an end to end mode.
On preview: I don't think it's a silly question at all. It may not be the most efficient method, but it's a fun thought experiment and it could even be fun to implement. It's also an excellent question for bringing out some of the issues and confluences of digital communication theory, information theory and speech coding.
posted by forforf at 10:00 AM on March 15, 2010
Response by poster: Thanks everyone for the awesome answers! I learned more about communication and compression and had fun while reading through. :)
posted by Funmonkey1 at 10:04 PM on March 15, 2010
posted by Funmonkey1 at 10:04 PM on March 15, 2010
Would Morse Code count as two humans "speaking" to each other in a binary way?
My dad did Morse in the Navy (in the secret parts) and he can speak it pretty fast, as well as listening. Similar to the talk above about just using vowels, but it was "da" and "di".
Morse is binary with a slight encoding to improve the speed.
posted by smackfu at 8:54 AM on March 16, 2010
My dad did Morse in the Navy (in the secret parts) and he can speak it pretty fast, as well as listening. Similar to the talk above about just using vowels, but it was "da" and "di".
Morse is binary with a slight encoding to improve the speed.
posted by smackfu at 8:54 AM on March 16, 2010
This thread is closed to new comments.
posted by teraflop at 4:28 AM on March 15, 2010