Name that algorithm
May 26, 2011 5:33 AM   Subscribe

Do you know of a hashing or encryption function that produces results of varying length depending on the frequency of the input?

This is part of a wider problem, in which I'm trying reduce unwieldy long character identifiers to shorter, more humane names. (This is just background - there are many constraints that may form a subsequent ask.metafilter question.) While pondering this problem, I've been nagged by a dim memory of a hash function that reduced input to a value that varied depending on the frequency of the input. That is, common values might produce the numbers 1, 12, 25, while uncommon inputs might hash to 23456, 3456789, etc. Such a solution would be perfect for my problem, giving the "frequent few" easy, short ids.

Does this strike a chord with anyone? Alternatively, are there keywords I could use to research this further? My Google-fu has so far produced nothing.
posted by outlier to Computers & Internet (14 answers total) 2 users marked this as a favorite
 
Best answer: You could do some kind of Huffman coding on the symbol table. Plenty of sample implementations out there.

It sounds like you don't really want a hash, because you need guaranteed uniqueness: the potential for hash collisions on program character identifiers sounds like a bad idea!
posted by pharm at 5:40 AM on May 26, 2011


In order to know the frequency of a given input, your hash function would need to have seen the universe of inputs prior to generating any hashes. Does that work for your problem?
posted by TruncatedTiller at 6:15 AM on May 26, 2011


Well, a Huffman coding is sort of what you want, but that's a lossless compression algorithm and it would give results that are way longer than what I think you want.

It would also be reversible, I don't know if you want that.

If you're using this to encode relatively short strings like identifiers (short relative to complete files) then the huffman code table will not be any shorter nor have superior performance to a lookup table with arbitrarily assigned IDs.

You'd have to generate the huffman table once from your universal data set and then distribute it to each node that needs to understand it, so there doesn't seem to be a reason why you couldn't just assign arbitrary IDs with varying lengths to your identifiers (in descending order of frequency).
posted by atrazine at 6:48 AM on May 26, 2011


Response by poster: Thanks to all. Some clarifying details, with the caveat that the original problem I'm trying to solve is (1) fairly tangled and (2) currently not well formed.

It sounds like you don't really want a hash, because you need guaranteed uniqueness

Well, maybe. Guaranteed uniqueness would be great, but "mostly unique" / "unique for most common cases"/ "unique for a group of related cases" might be an acceptable tradeoff for short friendly names.

Your hash function would need to have seen the universe of inputs prior to generating any hashes. Does that work for your problem?

Yes. The set of inputs will change slightly (i.e. be added to), but mostly we'll be working with the pre-existing set. So poor performance for the rare new input is acceptable.

If you're using this to encode relatively short strings like identifiers ...

24 characters. Currently. And arbitrary IDs is off the table due to other constraints - we need an algorithmic, non-centralized solution. Anyway, I'm off to read up Huffman coding ...
posted by outlier at 8:16 AM on May 26, 2011


If by "algorithmic, non-centralized solution" you mean that each node should be able to generate the correct short name just from the 24-character identifier, then Huffman codes aren't going to work. Huffman coding is a way to guarantee some desirable properties in the final lookup table, but it's really just a special way to assign arbitrary IDs.
posted by d. z. wang at 8:36 AM on May 26, 2011


This is really, really from left field, and I think incorrect, but does the Levenshtein distance help you? It's used to determine how many changes are required to transform one string into another, where a transformation is any of inserion, deletion, and substitution.

That way the Levenstein distance between a common string and an input, if small, would indicate a "common" input, and a large distance would indicate a "novel" input.

Along similar lines, and even less likely to be useful, consider n-gram matching. It's an interesting, context-free manner in which to determine if two large bodies of text are similar.
posted by asymptotic at 8:37 AM on May 26, 2011


Also, could you be more specific about your requirements? "Shorter" and "more humane" can have a number of different meanings depending on your projects. Shorter could mean fewer characters, or fewer bytes (in case of multibyte character sets) or fewer syllables when pronounced by a speaker of some language. More humane could mean shorter or only printable characters or pronounceable (which may require increasing the length to insert things like vowels).

So if it turns out you're looking for a short name that someone can easily read out over a telephone, then proquints might be up your alley.
posted by d. z. wang at 8:43 AM on May 26, 2011


I think you need two algorythms here. One for hashing , one for shortening or "hamaninfying" the result.

I would take a look at Google's has function CityHash. It is not a cryptographic hash but a hash developed for hashing long strings. Then shorten the hash Huffman Coding (I don't think that will work for you as expected, for short values you can actually end up with a longer result than the value you started with) or proquints, which looks interesting.
posted by Ad hominem at 9:04 AM on May 26, 2011


Doh, I think I misspelled almost every word there. Please don't let that put you off checking out Google's hash function.
posted by Ad hominem at 9:19 AM on May 26, 2011


Currently. And arbitrary IDs is off the table due to other constraints - we need an algorithmic, non-centralized solution.

The problem is that you'd have to precompute the huffman coding table based on your knowledge of the statistics of the universe of possible inputs and then distribute that coding table to each node. So you're already distributing a key:value store, and you might as well distribute a fully arbitrary one or not do it at all.

Can you use any insight into the internal structure of the identifiers? If the 24 character string has any predictable properties you can potentially build a very short permanent dictionary which is pre-distributed to all the nodes and combine that with a hash on the unpredictable part. This might have the effect of shortening many of your IDs, though it is not as conceptually elegant as a fully independent hash calculation at each node.

I think you might be up against a fundamental limit here because your algorithm:
1) Needs to depend on the frequency distribution of a parent population of IDs (because otherwise there is no way to cause the most common entries to consistently map to shorter IDs)
2) You want it to be an algorithmic hash that can be run by decentralised nodes (if I understand correctly) without any shared state like an arbitrary dictionary.

These two requirements seem incompatible to me because the first requirement implies shared state to me which is undesirable as per 2. Maybe someone with a more comprehensive CS education could correct me though, maybe there is a way to do this.
posted by atrazine at 11:35 AM on May 26, 2011 [1 favorite]


You are working with 24 chars, so I don't think you need to hash anything (should have read, your followup)

If these are random IDs or something then atrazine is correct, you will need a shared frequency table for huffman coding. If it is English or another natural language you can use common frequency table based off natural language inherent letter frequency bias.
posted by Ad hominem at 2:49 PM on May 26, 2011


Does it have to be re-creatable later? Seems like it wouldn't be that hard to write a little perl script (for someone who knew what they were doing, i.e., not me) that:

Read an input line
Checked if it matched one it already read
Incremented the count of that particular string

Until there was no more input, then:
resorted the list by frequency, most frequent first
numbered the lines monotonically
printed that list out as your key.

(Even better, save the numbered list as a key/value hash table so you could automate lookups later.)

Then, your line numbers would be your key, with "1" being your string for the most frequent input. 1 is pretty short, ya?

Or is that not what you're trying to do at all?
posted by ctmf at 4:33 PM on May 26, 2011


Response by poster: Thanks for all the feedback. As predicted by some, Huffman coding is not the way to go - it's solving a different problem and is a symbol-by-symbol solution. Still, useful to know for later work.

My initial interest - as indicated by the title - was trying to recall the name of the algorithm. I'm reluctant to go too much into the details of the wider problem, because it's currently very tangled, not well formed, and I'm trying to tease out the details. But here's the broad outline: we have a very large (10K+) group of FOO, each of which can be identified by 24 numbers (24 qualities measured by7 integers). However it's cumbersome to refer to them as "1-12-4-7-8-2 ..." and a (say) 7 character name would be very useful. There's lots of complications and constraints, but (1) the data is very clustered and (2) some non-uniqueness might be OK.

This is really, really from left field, and I think incorrect, but does the Levenshtein distance help you?

Some like that might be helpful as a way of showing divergence from a set of central, named cases. Interesting.

I would take a look at Google's has function CityHash.

A cool tool. Thanks.

These two requirements seem incompatible ...

Well, yes. I'm beginning to think I have an over-constrained problem here. Time for more pondering.

Again, thanks all.
posted by outlier at 1:10 AM on May 27, 2011


This could be entirely off-base, but when you say "very clustered" and "non-unique", that screams dimension reduction to me.

So, your original IDs are linear combinations over a 24-element basis B. You could do a principal-component analysis (this can be as simple as pulling the singular value decomposition out of your favorite linear algebra / scientific computing library) to find a new 24-element basis B'. It's also possible to represent each ID uniquely as a linear combination over B', but with the additional property that most of the variation occurs in the first few elements. So you can then truncate B' to produce a new basis B'' which has fewer elements, and construct some sort of short name by projecting each ID down into B''.

With things like grocery store receipts (each axis is the number of a particular item), this can get you from 150 axes down to like 20.

Does this make sense? You can google "high-dimensional data reduction" and "principal-component analysis" for more information.
posted by d. z. wang at 10:11 AM on May 29, 2011


« Older What can the animals do for revolution?   |   help find book of short stories Newer »
This thread is closed to new comments.