information theory
June 5, 2005 5:41 PM Subscribe
Any information theorists out there? How do I mesure the Information content/Entropy of the output of a hash algorithm?
Right now what I'm doing is, for each bit (a) of the output what's the probability that every other bit (b) is 1. Now in theory I should be able to calculate the Information content for b given a. Then I just add all of those up and get the total information of all of the bits that are not a given a.
Now is there a way to calculate the information value of the entire hash from this information?
Right now what I'm doing is, for each bit (a) of the output what's the probability that every other bit (b) is 1. Now in theory I should be able to calculate the Information content for b given a. Then I just add all of those up and get the total information of all of the bits that are not a given a.
Now is there a way to calculate the information value of the entire hash from this information?
oh! are you talking about hash like hash table? i'm thinking of crypto hashes.... duh!
posted by andrew cooke at 5:59 PM on June 5, 2005
posted by andrew cooke at 5:59 PM on June 5, 2005
Response by poster: I'm wondering how close to theoretical an actual algorithm is to perfection. Here at work we have this goofy little hash algorithm that sucks, and I'm trying to figure out exactly how bad it sucks, basicaly. (and this is for a hashtable application, we're not worried about how hard it is to reverse the hash)
When I generate data using an md5 hash of random input data, I get very close to a 0.5 probablity for each bit, so it's clear that it's working "very well" and no further tests are needed.
(in other words, given n bits of output, we get n*n probability values, p(bita | bitb) for all a and b, and all of those are about 0.5).
On the other hand, when I tried this other hash algorithm, those numbers were a bit off. like between 0.39 and 0.57.
---
So, the way I figure. If you have 32 bits of md5 hash, you've got 32 bits of "information". meaning that the probability of a colision is as low as possible 1/232.
On the other hand, how many bits of "information" do I have with the output of the other hash algorithm? Maybe I have something like 29 bits of information, meaning the probability of a collision is like 1/229.
posted by delmoi at 6:21 PM on June 5, 2005
When I generate data using an md5 hash of random input data, I get very close to a 0.5 probablity for each bit, so it's clear that it's working "very well" and no further tests are needed.
(in other words, given n bits of output, we get n*n probability values, p(bita | bitb) for all a and b, and all of those are about 0.5).
On the other hand, when I tried this other hash algorithm, those numbers were a bit off. like between 0.39 and 0.57.
---
So, the way I figure. If you have 32 bits of md5 hash, you've got 32 bits of "information". meaning that the probability of a colision is as low as possible 1/232.
On the other hand, how many bits of "information" do I have with the output of the other hash algorithm? Maybe I have something like 29 bits of information, meaning the probability of a collision is like 1/229.
posted by delmoi at 6:21 PM on June 5, 2005
"Information content" is a poorly defined term that means different things to different people. "Entropy" is somewhat better defined, but can still be interpreted a number of ways (1st order entropy? Conditional entropy?). And, as andrew cooke is asking, what do you mean by hash?
Your problem needs to be stated more clearly. Preferably with some mathematics.
posted by Galvatron at 6:21 PM on June 5, 2005
Your problem needs to be stated more clearly. Preferably with some mathematics.
posted by Galvatron at 6:21 PM on June 5, 2005
Response by poster: Huh. Now that I think about it, if there are n bits in a hash 2n possible output values (obviously).
The way to calculate the amount of information would be to sum up the individual probabilities of each distinct output value times the log base two of that output value (i.e. sum(0->2n, p(i) * log(1/p(i)))). Is there a way to arrive at that sum without knowing each the exact probability of each distinct output (which would require gigs of test data)
Now I’m starting to think that it isn’t possible unless we know each bit is statistically independent from each other bit, which I doubt is true.
Anyway, if anyone could enlighten me, that would be helpful.
posted by delmoi at 6:37 PM on June 5, 2005
The way to calculate the amount of information would be to sum up the individual probabilities of each distinct output value times the log base two of that output value (i.e. sum(0->2n, p(i) * log(1/p(i)))). Is there a way to arrive at that sum without knowing each the exact probability of each distinct output (which would require gigs of test data)
Now I’m starting to think that it isn’t possible unless we know each bit is statistically independent from each other bit, which I doubt is true.
Anyway, if anyone could enlighten me, that would be helpful.
posted by delmoi at 6:37 PM on June 5, 2005
Response by poster: Yes, and by "information content" I do in fact mean entropy.
posted by delmoi at 6:42 PM on June 5, 2005
posted by delmoi at 6:42 PM on June 5, 2005
Best answer: OK, I think I follow you now.
If n is sufficiently large, you're going to have a hard time getting a good estimate--as you say, gigs of test data. But you could maybe get a bound on the entropy, sufficient to at least demonstrate that the hash sucks. You can use the chain rule for entropy:
H(b1, b2, b3, ... bn) = H(b1) + H(b2, b3, ... bn | b1)
H(b1, b2, b3, ... bn) = H(b1) + H(b2 | b1) + H(b3, b4, ... bn | b1, b2)
H(b1, b2, b3, ... bn) = H(b1) + H(b2 | b1) + H(b3 | b1, b2) + H(b4, b5, ... bn | b1, b2, b3)
...
You can condition on any of the bits, I just chose b1 and b2 for convenience. Now H(b1) could be easily computed, and H(b2 | b1) can be computed just as you did above, and you could probably go and compute 3rd or 4th order terms without too much difficulty. The remaining term, the H(bi ... bn | b1, b2, ... , b_{i-1}) is the difficult one. You could estimate that by its maximum (i.e. assume independent equiprobable bits), and you'd end up with an upper bound on the entropy. If the upper bound is small relative to a good hash function then you have proven that your hash function sucks. The upper bound will get tighter if you can compute more terms of the chain rule expansion.
You can get n or more different upper-bounds by conditioning on different bits; choose the smallest upper-bound.
posted by Galvatron at 7:02 PM on June 5, 2005
If n is sufficiently large, you're going to have a hard time getting a good estimate--as you say, gigs of test data. But you could maybe get a bound on the entropy, sufficient to at least demonstrate that the hash sucks. You can use the chain rule for entropy:
H(b1, b2, b3, ... bn) = H(b1) + H(b2, b3, ... bn | b1)
H(b1, b2, b3, ... bn) = H(b1) + H(b2 | b1) + H(b3, b4, ... bn | b1, b2)
H(b1, b2, b3, ... bn) = H(b1) + H(b2 | b1) + H(b3 | b1, b2) + H(b4, b5, ... bn | b1, b2, b3)
...
You can condition on any of the bits, I just chose b1 and b2 for convenience. Now H(b1) could be easily computed, and H(b2 | b1) can be computed just as you did above, and you could probably go and compute 3rd or 4th order terms without too much difficulty. The remaining term, the H(bi ... bn | b1, b2, ... , b_{i-1}) is the difficult one. You could estimate that by its maximum (i.e. assume independent equiprobable bits), and you'd end up with an upper bound on the entropy. If the upper bound is small relative to a good hash function then you have proven that your hash function sucks. The upper bound will get tighter if you can compute more terms of the chain rule expansion.
You can get n or more different upper-bounds by conditioning on different bits; choose the smallest upper-bound.
posted by Galvatron at 7:02 PM on June 5, 2005
Best answer: that looks right.
i don't think the following is particularly rigorous, but it might be useful in practice. what you're trying to do is measure the "quality" of the hash. and so you're choosing the entropy of the hashed values to do so. which, to my limited knowledge, is correct, but expensive. but you could also look at the entopry of each bit.
on the one hand, that's naughty, because there's nothing to say that your bits are independent (except that if it's a perfect hash they should be - which might be good enough justification, actually, since that is the null hypothesis). on the other hand, it's more likely to detect problems, since in practice you will have "sticky" bits. and it is easy to measure.
so i'd suggest running a bunch of random data through your hash and measuring the probability for each bit. once you've done that, p log(p) for a particular bit tells you how good that bit is. and the sum tells you something about the hash as a whole. if we're very very lucky, it's the same as what you wanted, but i doubt it...
oh, on preview, i think i'm just saying a stupid version of galvatron. i'm tempted to delete this, but i guess maybe he(?) can explain how you get from what he said to what i said (am i assuming H(bi|j) is zero?) and maybe i'll learn something...
posted by andrew cooke at 7:14 PM on June 5, 2005
i don't think the following is particularly rigorous, but it might be useful in practice. what you're trying to do is measure the "quality" of the hash. and so you're choosing the entropy of the hashed values to do so. which, to my limited knowledge, is correct, but expensive. but you could also look at the entopry of each bit.
on the one hand, that's naughty, because there's nothing to say that your bits are independent (except that if it's a perfect hash they should be - which might be good enough justification, actually, since that is the null hypothesis). on the other hand, it's more likely to detect problems, since in practice you will have "sticky" bits. and it is easy to measure.
so i'd suggest running a bunch of random data through your hash and measuring the probability for each bit. once you've done that, p log(p) for a particular bit tells you how good that bit is. and the sum tells you something about the hash as a whole. if we're very very lucky, it's the same as what you wanted, but i doubt it...
oh, on preview, i think i'm just saying a stupid version of galvatron. i'm tempted to delete this, but i guess maybe he(?) can explain how you get from what he said to what i said (am i assuming H(bi|j) is zero?) and maybe i'll learn something...
posted by andrew cooke at 7:14 PM on June 5, 2005
If all the bits are independent, then H(b1, b2, ... bn) = H(b1) + H(b2 | b1) + H(b3 | b1, b2) + ... + H(bn | b1, b2, ..., bn-1) = H(b1) + H(b2) + H(b3) + ... + H(bn). So there's your connection, andrew. But if the problem is related to "sticky bits" then the independence assumption isn't going to get you anywhere. The upper bound I proposed is capable of capturing some of those correlation effects, as long as only small numbers of bits are highly correlated.
posted by Galvatron at 7:27 PM on June 5, 2005
posted by Galvatron at 7:27 PM on June 5, 2005
thanks!
("sticky" bits can still be independent, i think, but i see your wider point).
posted by andrew cooke at 7:41 PM on June 5, 2005
("sticky" bits can still be independent, i think, but i see your wider point).
posted by andrew cooke at 7:41 PM on June 5, 2005
Yeah, I assumed by "sticky bits" you meant "bits that are stuck together," i.e. correlated. But now that I read you again it sounds like you had something else in mind.
posted by Galvatron at 7:48 PM on June 5, 2005
posted by Galvatron at 7:48 PM on June 5, 2005
Response by poster: Galvatron: Thanks a lot, I didn't know that there was a chain rule for entropy. I can probably just write my program to compute a tigher and tighter bound until I get bored :P
Just to clarify the notation: "H(a, b, c | x, y, z)" means the same as "H(a,b,c | x y z)" and the same "H(a,b,c | x = 1 AND y = 1 AND z = 1)". or "the entropy of bits a, b, and c given that bits x y and z are set to one". Or does it mean something else?
Or does it mean something else. I've always been confused at the whole "given (comma separated list)" notation.
--
andrew cooke: what Galvatron did was explain a way that you can reduce the problem from assuming that all the bits are independent to assuming that some of the bits are independent. The fewer bits you assume, the more accurate your result, but the longer it takes to get that result.
Basically what you do (and hopefully Galvatron can confirm this :) is find the entropy of one particular bit, b1, and find the entropy for that bit H(b1) then do what you suggest to find the entropy of the rest of the bits H(b2 ... bn) when b1 is turned on, which is the same as H(b2 ... bn | b1)).
So then you add those two values together and get a slightly more accurate answer.
Then you can take it another step and find H(b1) then H(b2 | b1) and then after you’ve found those two values assume independence on the last n – 2 bits given that b1 and b2 are turned on. Then add all those values together and you have an even closer approximation.
You can keep doing that, but each step requires twice as much sample data
---
After I posted my last comment I rode my bike around for awhile and came up with a next step of trying to figure out all of the bits that were dependant. After all, you could assume pretty well I think that if two bits appear to be independent (i.e. if p(a | b) = p(a | !b) and p(b | a) = p(b | !a)) then that would, um, simplify things.
Combining this with the chain rule, we order the ‘most independent’ bits on the end and improve (by an unmeasurable amount) the accuracy of our estimate.
hmm. Math is fun :)
posted by delmoi at 7:56 PM on June 5, 2005
Just to clarify the notation: "H(a, b, c | x, y, z)" means the same as "H(a,b,c | x y z)" and the same "H(a,b,c | x = 1 AND y = 1 AND z = 1)". or "the entropy of bits a, b, and c given that bits x y and z are set to one". Or does it mean something else?
Or does it mean something else. I've always been confused at the whole "given (comma separated list)" notation.
--
andrew cooke: what Galvatron did was explain a way that you can reduce the problem from assuming that all the bits are independent to assuming that some of the bits are independent. The fewer bits you assume, the more accurate your result, but the longer it takes to get that result.
Basically what you do (and hopefully Galvatron can confirm this :) is find the entropy of one particular bit, b1, and find the entropy for that bit H(b1) then do what you suggest to find the entropy of the rest of the bits H(b2 ... bn) when b1 is turned on, which is the same as H(b2 ... bn | b1)).
So then you add those two values together and get a slightly more accurate answer.
Then you can take it another step and find H(b1) then H(b2 | b1) and then after you’ve found those two values assume independence on the last n – 2 bits given that b1 and b2 are turned on. Then add all those values together and you have an even closer approximation.
You can keep doing that, but each step requires twice as much sample data
---
After I posted my last comment I rode my bike around for awhile and came up with a next step of trying to figure out all of the bits that were dependant. After all, you could assume pretty well I think that if two bits appear to be independent (i.e. if p(a | b) = p(a | !b) and p(b | a) = p(b | !a)) then that would, um, simplify things.
Combining this with the chain rule, we order the ‘most independent’ bits on the end and improve (by an unmeasurable amount) the accuracy of our estimate.
hmm. Math is fun :)
posted by delmoi at 7:56 PM on June 5, 2005
Response by poster: oops. At least I didn't leave one of my <sub> tags open this time :P
posted by delmoi at 7:57 PM on June 5, 2005
posted by delmoi at 7:57 PM on June 5, 2005
sorry - i just meant they don't change so easily ("stuck at 1").
on preview (my god, the attack of the bold tag! :o) - yes (and the optimisation idea is a good one).
posted by andrew cooke at 8:01 PM on June 5, 2005
on preview (my god, the attack of the bold tag! :o) - yes (and the optimisation idea is a good one).
posted by andrew cooke at 8:01 PM on June 5, 2005
Response by poster: wish I had spent a little time proofing that post before I posted it, there are several mistakes. Oh well.
posted by delmoi at 8:04 PM on June 5, 2005
posted by delmoi at 8:04 PM on June 5, 2005
what i was trying to get at with my "null hypothesis" comment was that presumably there's some known distribution of the measured entropy for a perfect hash over a finite stream. so that, the longer the stream you feed in, the closer you get to the asymptotic value.
what delmoi is trying to do is prove that his hash is not perfect, and get some measure of how bad it is.
the normal way to do this is to assume that things are "perfect" (the null hypothesis), and then show that, under that assumption, the results are unbelievable. and how unbelievable they are is a measure of how bad the situation is (roughly).
so in this case, the null hypothesis is that the hash function is perfect. in which case we can add the bits up as if they are independent. and then we can compare the difference between what we get and what the perfect value should be, and see how likely that would be, given the known distribution i referred to earlier. which would give you a measure of how bad the hash is.
now at some point that seems flakey. aren't you getting something for nothing? the flakey bit is that you only have a conservative estimate - it may be much worse than you actually measured. but in most cases, that is acceptable.
does that make sense? if so, what's the distribution?
posted by andrew cooke at 8:10 PM on June 5, 2005
what delmoi is trying to do is prove that his hash is not perfect, and get some measure of how bad it is.
the normal way to do this is to assume that things are "perfect" (the null hypothesis), and then show that, under that assumption, the results are unbelievable. and how unbelievable they are is a measure of how bad the situation is (roughly).
so in this case, the null hypothesis is that the hash function is perfect. in which case we can add the bits up as if they are independent. and then we can compare the difference between what we get and what the perfect value should be, and see how likely that would be, given the known distribution i referred to earlier. which would give you a measure of how bad the hash is.
now at some point that seems flakey. aren't you getting something for nothing? the flakey bit is that you only have a conservative estimate - it may be much worse than you actually measured. but in most cases, that is acceptable.
does that make sense? if so, what's the distribution?
posted by andrew cooke at 8:10 PM on June 5, 2005
(you see, without some kind of distribution, just measuring this doesn't tell you much).
posted by andrew cooke at 8:14 PM on June 5, 2005
posted by andrew cooke at 8:14 PM on June 5, 2005
Best answer: Just to clarify the notation: "H(a, b, c | x, y, z)" means the same as "H(a,b,c | x y z)" and the same "H(a,b,c | x = 1 AND y = 1 AND z = 1)". or "the entropy of bits a, b, and c given that bits x y and z are set to one". Or does it mean something else?
H(X1, X2 | X3, X4) means \sum_{x3, x4} p(x3, x4) H(X1, X2 | X3=x3, X4=x4), where the summation is taken over all values of the string x3x4; x3x4 = "11" is just one of four possibilities. In words, "the average entropy of bits X1 and X2 conditioned on bits X3 and X4." The commas are not important, as long as it's clear that (A B) and (A, B) both represent a variable that takes on ordered tuple values.
Combining this with the chain rule, we order the ‘most independent’ bits on the end and improve (by an unmeasurable amount) the accuracy of our estimate.
Yes, conditioning on the most highly correlated bits will significantly improve the bound.
posted by Galvatron at 8:33 PM on June 5, 2005
H(X1, X2 | X3, X4) means \sum_{x3, x4} p(x3, x4) H(X1, X2 | X3=x3, X4=x4), where the summation is taken over all values of the string x3x4; x3x4 = "11" is just one of four possibilities. In words, "the average entropy of bits X1 and X2 conditioned on bits X3 and X4." The commas are not important, as long as it's clear that (A B) and (A, B) both represent a variable that takes on ordered tuple values.
Combining this with the chain rule, we order the ‘most independent’ bits on the end and improve (by an unmeasurable amount) the accuracy of our estimate.
Yes, conditioning on the most highly correlated bits will significantly improve the bound.
posted by Galvatron at 8:33 PM on June 5, 2005
so in this case, the null hypothesis is that the hash function is perfect. in which case we can add the bits up as if they are independent. and then we can compare the difference between what we get and what the perfect value should be, and see how likely that would be, given the known distribution i referred to earlier. which would give you a measure of how bad the hash is.
Well, if everything's independent and equally probable (which I guess is what you want from an ideal hash), the joint entropy is H(b1, b2, ... bn) = H(b1) + ... + H(bn) = n H(b1) = n. So the question is, how far is the true entropy from that value.
Then again, I have no idea whether any real hash comes close to maximizing entropy.
posted by Galvatron at 8:44 PM on June 5, 2005
Well, if everything's independent and equally probable (which I guess is what you want from an ideal hash), the joint entropy is H(b1, b2, ... bn) = H(b1) + ... + H(bn) = n H(b1) = n. So the question is, how far is the true entropy from that value.
Then again, I have no idea whether any real hash comes close to maximizing entropy.
posted by Galvatron at 8:44 PM on June 5, 2005
Response by poster: Andrew: I'm not quite sure what you're getting at.
What I'm doing is generating hundreds of thousands distinct of chunks of data and then hashing them, resulting in hundreds of thousands of data items which are generated by the hash function.
Actually you could think of them as just being randomly generated bit strings and imagine that we are just trying to come up with a good random number generator.
What I want to do is figure out the probability of a collision. With a 'perfect' hash n bits long, the probability of a collision where hash(x) = hash(y) is 1/(2n). But what about an imperfect hash? Well, what I’m assuming is that we can replace 'n' being the number of actual bits of length with the number of bits of entropy. So if we have 29.9435 bits of entropy, the probability of a collision is 1/(2(229.9435 ).
Think of the simplest degenerate case. All of the bits truly independent and each has a 0.5 probability of being turned on, except for the first one, which is always turned on. We have one useless bit, and so we have 31 bits of information for our 32 bits of data.
That make answer your question?
posted by delmoi at 8:44 PM on June 5, 2005
What I'm doing is generating hundreds of thousands distinct of chunks of data and then hashing them, resulting in hundreds of thousands of data items which are generated by the hash function.
Actually you could think of them as just being randomly generated bit strings and imagine that we are just trying to come up with a good random number generator.
What I want to do is figure out the probability of a collision. With a 'perfect' hash n bits long, the probability of a collision where hash(x) = hash(y) is 1/(2n). But what about an imperfect hash? Well, what I’m assuming is that we can replace 'n' being the number of actual bits of length with the number of bits of entropy. So if we have 29.9435 bits of entropy, the probability of a collision is 1/(2(229.9435 ).
Think of the simplest degenerate case. All of the bits truly independent and each has a 0.5 probability of being turned on, except for the first one, which is always turned on. We have one useless bit, and so we have 31 bits of information for our 32 bits of data.
That make answer your question?
posted by delmoi at 8:44 PM on June 5, 2005
it doesn't matter really. i was guessing at what you were trying to do - i assumed that you wanted to go to someone and say "this hash is no good". how do you prove it is no good? you're using an approximation to measure the entropy - that's going to result in a certain amount of noise in the value. so how do you know whether you got a low entropy value because it was a bad hash, or because you were just unlucky in your (approximate) measurement? that was the question i was addressing, but i don't think it's the question you care about (perhaps because it's so bad it's obvious).
posted by andrew cooke at 8:54 PM on June 5, 2005
posted by andrew cooke at 8:54 PM on June 5, 2005
Response by poster: Then again, I have no idea whether any real hash comes close to maximizing entropy.
Well, from my experimentation md5 is as good at creating random strings as binary strings generated by Java's built in "cryptographicaly secure" random number generator.
This is feeding it random strings of ASCII digits, as well as integers such that the first hash is the hash of the binary representation of '0' the second hash is the a hash of the binary representation of '1' and so on.
posted by delmoi at 8:57 PM on June 5, 2005
Well, from my experimentation md5 is as good at creating random strings as binary strings generated by Java's built in "cryptographicaly secure" random number generator.
This is feeding it random strings of ASCII digits, as well as integers such that the first hash is the hash of the binary representation of '0' the second hash is the a hash of the binary representation of '1' and so on.
posted by delmoi at 8:57 PM on June 5, 2005
Response by poster: H(X1, X2 | X3, X4) means \sum_{x3, x4} p(x3, x4) H(X1, X2 | X3=x3, X4=x4), where the summation is taken over all values of the string x3x4; x3x4 = "11" is just one of four possibilities.
Ah. That makes sense. Thanks.
posted by delmoi at 9:01 PM on June 5, 2005
Ah. That makes sense. Thanks.
posted by delmoi at 9:01 PM on June 5, 2005
the standard secure java random number generator implementation uses a hash, iirc. crypto hashes are pretty much perfect.
posted by andrew cooke at 9:02 PM on June 5, 2005
posted by andrew cooke at 9:02 PM on June 5, 2005
Since a hash function in counter mode is essentially a pseudo-random number generator (PRNG), you might be interested to know that the U.S. National Institute of Standards and Technology (NIST) Computer Security Division has developed a battery of statistical tests to evaluate PRNGs. See NIST Special Publication 800-22, "A Statistical Test Suite for the Validation of Random Number Generators and Pseudo Random Number Generators for Cryptographic Applications." The publication, as well as code and documentation for the test suite, are available online. I realize that this isn't really what you are looking for but I figured it might be of related interest to those reading this thread.
posted by RichardP at 9:57 PM on June 5, 2005
posted by RichardP at 9:57 PM on June 5, 2005
This thread is closed to new comments.
(because a perfect hash is a dense as possible, right?)
sorry if this doesn't help.
posted by andrew cooke at 5:58 PM on June 5, 2005