Reality check, only ten years late
May 18, 2010 1:16 AM Subscribe

I dreamed up this hash function during a manic episode ten years ago. How good is it, really?

I'll call it manic512() for want of any other name. Its input[] parameter is an array of 32768 bits, and its result is an array of 512 bits.

Internal to manic512() is a pre-defined, read-only 32768 by 512 array of noise bits. Compatible implementations share the same noise[][] values, which should be derived from some physical equivalent of 16,777,216 fair coin tosses, not from any form of PRNG; there should be no algorithm at all capable of deriving any 512-bit element of noise[] from any of the others.

The output of manic512() is simply the XOR of all noise[i] for which input[i] is 1.

Could some kind mefite with actual crypto skills satisfy my now mostly idle curiosity about how collision-resistant this thing is?

It's clearly not very reversal-resistant for input[] values containing small numbers of 1-bits. This is absolutely unimportant for the application I originally thought of it for, but I would be interested to know whether something cleverer than brute force search and lookup would work to reverse manic512() once the number of 1-bits present in input[] is more than, say, fifty.

I'd also like to know who else thought of it before I did (which seems to be the way with all my best inventions) and what they called it.

posted by flabdablet to Computers & Internet (24 answers total) 6 users marked this as a favorite

there should be no algorithm at all capable of deriving any 512-bit element of noise[] from any of the others

On reconsideration, that's overambitious. What I'm trying to get at is that there should be no such algorithm that wouldn't be busted simply by generating a fresh set of noise[][] values.
posted by flabdablet at 1:39 AM on May 18, 2010

Why would the number of 1-bits matter?
posted by phrontist at 1:40 AM on May 18, 2010

For an input[] containing a single 1-bit somewhere amongst 32767 0-bits, manic512() will match one of the noise[] values and all you need to do is search noise[] to find which one.

For two 1-bits, you generate all the possible pairs of noise[] values (there will be 32768 * 32767 of those) and search them. For three, you search 32768 * 32767 * 32766 values and so on.

I pulled the figure of 50 bits out of my arse, but consideration reveals that once the search space grows to 2²⁵⁶ the thing will start generating collisions via the birthday paradox, effectively becoming irreversible. Since 2²⁵⁶ is roughly 32768¹⁷, we hit irreversibility at about 17 1-bits in input[] - unless somebody thinks of something cleverer than brute force search.
posted by flabdablet at 1:51 AM on May 18, 2010

I have a feeling that LtU is a better forum for this.
posted by jrockway at 2:47 AM on May 18, 2010

Seems like it would be easy to come up with a 32768 bit string that would collide with your hash, because all the bits are orthogonal.

Here's how I would try to solve it. Build a 2¹⁵x(2¹⁵ + 1) matrix M, where each row and each column represent one of the keys, and so each element represents a pair of keys. The last column represents an 'unpaired' key.

The idea is, a pair of keys would produce the output we want when XOR'd together.

So, if the first bit in our target T₀ is a zero, we would fill each element M_ij with a 1 if i₀ XOR j₀ = 0. (and if T₀ = 1 then, of course M_ij = 1 if i₀ XOR j₀ = 1).

Now we have a matrix with about half of the elements either one or zero. That's 2¹⁵⁺¹⁵/2 = 2²⁹ possible pairs to explore in the next phase. Of course, we can't go and build a 'pair of pairs' table with 2²⁹⁺²⁹/2 = 2⁵⁷ elements, and keep expanding it until we get 2^{(2⁵¹²)+15-512} (or so) elements, that's obviously not practical.

But I don't think we need too. We have a TON of candidates, 2²⁹. So we just use the first 2¹⁵ so we can re-use the same matrix, Except now each row and column represent a PAIR in the previous matrix, and obviously we have to keep track of those somehow. But anyway, on to the next bit T₁

We do the same thing, filling out the 'matching' pairs of pairs. Again, or matrix should be 15% full,

So we just loop 512 times, each time picking out a pair of pairs from our truncated space. Seems like we should be able to reverse the hash in 512 steps. Obviously we can't get the original string back, but we've got a string that matches it.

(I think)

This would work if you were only worried about random collisions, not intentional ones (so for a hashtable, it would be fine I think) But it's probably slower then MD5.
posted by delmoi at 3:13 AM on May 18, 2010

Again, or matrix should be 15% full,

Ugh, that should read our matrix should be 50% full
posted by delmoi at 3:14 AM on May 18, 2010

Oh, and how do you keep track of which pairs are which? Just keep a bunch of 2¹⁵ bit buffers. They start out with just a single bit set. Then, if it turns out that two of the buffers AND'd together will hash together will set the bit you are currently looking at to the correct value, you AND (Or XOR) them together. Then you only keep the first 2¹⁵ buffers, (or you could do even fewer, since that would take up a lot of memory) and then you loop again with the next bit. Eventually, you go through all the bits and have a bunch of candidate strings.

In fact, you could actually keep your list sorted, and use that to find the smallest (i.e. the most leading zeros) candidate string possible.
posted by delmoi at 3:29 AM on May 18, 2010

So, if the first bit in our target T0 is a zero, we would fill each element Mij with a 1 if i0 XOR j0 = 0. (and if T0 = 1 then, of course Mij = 1 if i0 XOR j0 = 1).

Sorry, I'm lost. Does i₀ refer to bit 0 of the index i itself, or bit 0 of the 512-bit value at noise[i]?
posted by flabdablet at 3:32 AM on May 18, 2010

bit 0 of the 512-bit value at noise[i]
posted by delmoi at 3:42 AM on May 18, 2010

We do the same thing, filling out the 'matching' pairs of pairs.

Sorry again; don't understand how this is accomplished. If I understand the first step correctly, we initially fill M_ij with bits that say whether or not noise[i][0] matches noise[j][0]. How are the M values derived on the second step? Are we working with noise[i][1] and noise[j][1] now, or what? How do the second-step i and j indexes relate to the original M array, to the bit string whose hash we're trying to duplicate, and to the bit string we're going to generate to do that?

The idea is, a pair of keys would produce the output we want when XOR'd together.

This is the part I don't really get. There are only 2³⁰ pairs of noise[] values. XORing those pairs will produce a list of 2³⁰ 512-bit values, each of which corresponds to the hash of some input[] containing 32766 zeroes and two ones.

2³⁰ is vastly smaller than the 2²⁵⁶ possibilities we'd need in order to see a collision via the birthday paradox, so provided only that manic512() doesn't generate horribly clumpy results (and I can't see why it should), isn't it overwhelmingly unlikely that we could find a pair of keys that produce a predefined output value when XOR'd together? Wouldn't we need to allow groups of up to 17 or 18 keys (see earlier calculation) to make that work?
posted by flabdablet at 4:33 AM on May 18, 2010

Hmm, I just figured out a problem with my method actually, which is that if two of the target sequences/sets we are building share a common bit, they can't actually be added together to create a new target sequence that matches more bits of the hash. However, I think that this problem can be avoided when picking new targets to keep at each step.

(Sorry if this is somewhat inscrutable :P) When i say "target sequences/sets we are building" I'm talking about the "pair of pairs of pairs of..." Basically a collection of sequences that when hashed will match the first N bits of the hash at step N. And at each step we expand them by ANDing them together, and getting rid of the ones that no longer match. The problem is that if the AND is not equal to the XOR they won't actually work)

Anyway, I think the problem can be resolved by making sure we pick elements that overlap as little as possible in each step. It does complicate things a little.
posted by delmoi at 4:38 AM on May 18, 2010

Alright let me try to explain this again. We actually don't need M, here. That was how I first visualized it, but we needed a way to keep track of which pairs of sets we were picking, and actually it turns out that the way we would keep track of those sets does the job of the matrix.

So let me try to explain the simplified algorithm. It basically uses a big list of 2¹⁵ bit strings. (that's 32768 bits, of course)

How big is the list? It doesn't matter, but probably more then 1000 maybe a million or so. Let's call that number Q.

When we initialize Q we the first entry is all zeros, and after that Q[i][i-1] = 1. So the second entry in Q (Q[1], since these are zero based) has bit #1 (Q[1][0]) set to 1 all others are zero.

Basically it looks like this
...0000000000000
...0000000000001
...0000000000010
...0000000000100

After the first 32786 entries, we don't worry about initializing them. Maybe set them to all ones. Anyway they'll be ignored in the first step, and filled in later steps (of course, that's not an issue if we use less then 32k)

So we have our initialized list.

Lets start with the first step, let's call the step number N, and right now we're on N = 0. We loop through i and j ∈ Q (it would take N² steps to do all of them, but we can use a heuristic to pick good i's and j's and make it run quicker if Q is large)

Now, what we do is check to see if hash((Q[i] AND Q[j]))[0] matches T[0] where is the target sequence we are trying to match. we know half will match So we have Q²/2 = 2^lg(Q+Q)-1. That is a LOT of matches, so we throw out almost all of them, and just keep Q. In fact, the heuristic we use should help us avoid computing most of them.

Now lets move on to step two.

Our Q now contains vectors that would mostly contain 2 bits. But, as I'll explain later, vectors with fewer bits are better. Since half the strings will match T[0] when hashed, we know that for all i > 0 hash((Q[i] AND Q[0]))[0] = T[0] about half the time. Those will have one bit set. So 50% of our strings in Q have just one bit set, and 50% of our strings now have two bits set.

Okay, so we do the same thing again. loop through all i and j ∈ Q. Again, half the strings, when hashed will match part of T, in step 2, they'll match the first two bits. We don't need all of them, so we want to pick 'good' ones for the next step.

Here's where the problem I mentioned above comes in. If we have two seequences like
...0000100001000 and
...0000000001010

ANDing them together will no longer give us the result we want because the result would give us noise[A] ^ noise[B] ^ noise[C], and what we actually know at this step is that both (noise[A] ^ noise[B]) and (noise[B] ^ noise[C]) would match the first N bits of the target.

--

So, here's where the importance of our heuristic comes in. We want to pick elements that are least likely to collide in the next round, and of course at each round the number of bits increases. If we were just picking randomly, the number of set bits would double at each step. But of course we want to pick the elements with the fewest active bits at once. We also want to pick ones that don't overlap the least, which should be possible. We always add the zero vector back in, because half of our candidates from the last round will match the next bit of the hash too. So at each step, half of our candidates keep their current number of bits.

This is the part I don't really get. There are only 2³⁰ pairs of noise[] values. XORing those pairs will produce a list of 2³⁰ 512-bit values, each of which corresponds to the hash of some input[] containing 32766 zeroes and two ones.

That's correct.

2³⁰ is vastly smaller than the 2²⁵⁶ possibilities we'd need in order to see a collision via the birthday paradox

We only need 3 pairs for a collision, because we are only comparing one bit at a time. Obviously, we want to keep more then three because we want a lot of variety at each step.
posted by delmoi at 5:20 AM on May 18, 2010

(er, 3 vectors would have at least one collision between them, but they could all not match. There would be a 1-1/2³ chance of a match and in general 1-1/2^Q chances of having at least one matching pair)
posted by delmoi at 5:31 AM on May 18, 2010

Well, I think that totally reverse-engineering it would require 32769 tries; I'm not sure anyone can do better than that.

Initialize X with 32768 random bits (to meet the requirements of your question)
Initialize O with hash(X)
Initialize Internal as a 32768 array of 512 bit entries
For each index I in 1...32768,
   Flipper = 1 left shift I
   X' := X XOR Flipper
   O' := hash(X')
   Internal[I] :=  O XOR O'
   X := X'

Now Internal stores the 32768-by-512 array internal to your hash function.

---------------

Regarding creating nontrivial collisions without much effort, I believe this can be done within 513 evaluations of your hash function. These 513 evaluations can be used to reconstruct 512 of your internal 512-bit array entries. We don't need all 32768 just to generate collisions, I think.

I don't have a proof for this and I have to go very soon so I will have to return to this later unless I am refuted in the interim, but it is my strong suspicion that there are not more than 512 "x-independent" columns of a 512-row binary matrix, where "x-independence" is what I'm calling the property that no row can be produced by XORing any two rows and no column can be produced by XORing any two columns.

If this is true, then knowing 512 512-bit entries in your hash's internals should be sufficient to generate collisions. More clarification later if it is desired- gotta go.
posted by a snickering nuthatch at 7:08 AM on May 18, 2010

I see. You have 32,768 random patterns, and the difficulty comes from determining the precise combination of inputs that, when XORed together, lead to a particular output.

Here's a basic preimage attack: Find n elements of the noise[] array that XOR to 0, where the identifier of the element matches a 0 bit in the original plaintext. Set those bits to 1 in the original plaintext, and you will have a new plaintext with the same hash as the original (since the contribution of the added bits will have been defined to have been 0).

The above can of course be inverted to set 1s back to 0.

Collisions are even easier in your scheme. Simply find *any* subsets among the random data that XOR to the same value, not necessarily 0. If noise[10]^noise[100] == noise[20] ^ noise[200], you're done. You're not constrained to two elements.

There's very likely much easier attacks, and I am indeed handwaving on the difficulty of finding these subcollisions, but I've gotta get some sleep :)
posted by effugas at 7:16 AM on May 18, 2010

Note that I'm assuming the 32,768 random patterns are public.
posted by effugas at 7:19 AM on May 18, 2010

Corrections to my post: the algorithm for extracting a 512-bit entry in your hash's internal state should include the assignment

O := O'

at the end of the loop. Broadly, we're saying "I'm just going to change one bit in the input; I'll find out what bits changed in the output as a result of my change; that tells me which bits correspond to the internal state that are used for the one bit that we changed".

Where I'm talking about "x-independence", the constraint is not that no row can be produced by XORing two rows; it's really that no combination of rows can be XORed to produce a row. That is, what effugas said, with the added conjecture that at most 512 entries of 512 bits are sufficient.
posted by a snickering nuthatch at 8:04 AM on May 18, 2010

Internal to manic512() is a pre-defined, read-only 32768 by 512 array of noise bits. Compatible implementations share the same noise[][] values

I'll assume that noise[][] is public and known. If it's meant to be secret, then it won't be for long since you can infer it by calling manic512() with all the weight-one inputs (i.e.: the 32,767 input vectors with exactly a single one).

If I've understood your description correctly, the j^th output bit of manic512() (for j=1,2,3,..,512) can be computed as follows:

output[j] = input[1]*noise[1][j] + input[2]*noise[2][j] + input[3]*noise[3][j] + ... + input[32767]*noise[32767][j]

where + is XOR and * is AND. The notation is chosen thusly because XOR and AND are simply addition and multiplication mod two. So, what you have is the output is simply a linear (mod 2) function of input[] and noise[][] -- essentially, 512 equations in 32767 unknowns. Given output[], you can use basic Gaussian elimination to express 512 bits of input[] as a linear function of the 32255 other bits of input[]. In other words, it's trivial to mount a preimage attack on manic512() and hence collision attacks are also trivial.

Aside from the fact that the manic512() is purely linear, the main indication that this function is not suitable as a hash is the poor avalanching. If noise[i][j] is zero then the j^th bit of the output is independent of the i^th bit of the input. Every bit of the output should be dependent on every bit of the input.
posted by mhum at 8:52 AM on May 18, 2010 [2 favorites]

32255 other bits of input[]

Sorry, make that 32256 other bits.
posted by mhum at 8:54 AM on May 18, 2010

... So, what you have is the output is simply a linear (mod 2) function of input[] and noise[][] -- essentially, 512 equations in 32767 unknowns. Given output[], you can use basic Gaussian elimination to express [any arbitrarily selected] 512 bits of input[] as a linear function of the 32255 other bits of input[].

Put whatever you like in the 32256 other input bits, fill in the selected 512 with the results of your Gaussian elimination ...aaaand, busted!

Thanks, mhum. Nice to be able to let go of another bit of useless tat, even if it is ten years too late.
posted by flabdablet at 7:49 PM on May 18, 2010

What would have been wrong with MD5 or SHA or some other common cryptographic hash function? (I realize MD5 is considered a little weak, but you could combine it with something else to make it more future proof against new techniques)
posted by delmoi at 11:27 PM on May 18, 2010

Nothing is badly wrong with any of the standard crypto hash functions, as far as I know; that's why they're standard. But mania is a funny thing. Everything invented while manic is instantly perceived as not only shining and perfect and therefore inherently superior to any conceivable alternative, but potentially world-shatteringly significant. The pleasure of manic invention is so extreme that it pretty much completely shuts out any chance that reality might intrude.

Having recovered from mania many years ago, I had pretty much forgotten about manic512() until reminded of it in the course of the discussion with jlind0 over in Projects. So I asked this question partly to satisfy my own curiosity (because even ten years later I could still feel some of the original delight associated with its invention) and partly to show him what an actual reality-checking conversation looks like.
posted by flabdablet at 2:34 AM on May 19, 2010

Flab,

If it's any consolation, ain't nothing harder -- *nothing* -- than generating a strong crypto function. It is truly the sine qua non of defensive engineering, which is already a big pain in the butt.
posted by effugas at 7:12 AM on May 21, 2010

I think being ten years down the line has put me safely past the point of requiring consolation, but thanks anyway for the kind thought.

And just to round this out, here's the Hash Function Lounge which appears to be a one-stop shop for information on what's available and what's still considered safe.
posted by flabdablet at 3:47 AM on May 22, 2010

« Older Know anything about mid-century California... | Good running programs? Newer »

This thread is closed to new comments.

Ask MetaFilter

Reality check, only ten years late
May 18, 2010 1:16 AM Subscribe

Tags

Share

Reality check, only ten years late May 18, 2010 1:16 AM Subscribe

Tags

Share

Reality check, only ten years late
May 18, 2010 1:16 AM Subscribe