How can I quickly match one of 150 million names to a code?
September 3, 2009 10:10 AM   Subscribe

I have up to 150 million product names, I need to match each of them to 1 or more codes. How can I do this very quickly and with little infrastructure requirements?

Basically, up to 100,000 names will be checked against the master list of 150 million to see if they are on that specific list (they may not be). If they are on that list, the system will return one or more codes back. While I'm sure I can use a database for this activity, I'd love to hear other suggestions for alternatives such as a specific disk backed HashMap or other similar technology. A Java based or solution with Java bindings would be preferred. The system could possible dedicate up to 1GB of Ram in a transient fashion to this lookup (though much less would be preferred).

The names and mappings to code are relatively static, changing once a month at their maximum. I can pre-generate a digested format if that is required for the library to work.

A name is a string up to 255 characters, a code is a Java Int.

This would need to run on a single system running 64 bit Java JDK 1.5.0 and I'd rather that the solution use in process communication instead of network communication.

A solution using a single digested custom data file and a single Java jar library would be very nice.

Thank in advance.
posted by bottlebrushtree to Computers & Internet (6 answers total) 2 users marked this as a favorite
 
Derby for your inproc DB and very simple SQL.
posted by cmm at 10:19 AM on September 3, 2009


100,000 names how often? Every second, hour, day, month?

Do you expect most of the lookups to succeed, or just a small fraction?
posted by effbot at 11:12 AM on September 3, 2009


Response by poster: Most lookups should fail, some percentage (thousands to 10's of thousands) will succeed.
This would be done once per day.
posted by bottlebrushtree at 11:58 AM on September 3, 2009


It wouldn't be all of the solution, but you could use a Bloom filter as an optimization. This is a very space-efficient structure that would let you test quickly and purely in-memory whether a name was likely to be a member of the set. It's probabilistic, giving false positives but not false negatives, so if a name is rejected by the Bloom filter you can discard it immediately, but if it's accepted then you can do a more expensive database lookup to confirm. You should be able to get fairly good accuracy if you devote a gig of memory to it. It's fairly simple to write one, and I've come across Java implementations before.
posted by siskin at 12:37 PM on September 3, 2009


Assuming that you want the results pretty quickly, siskin's approach should work just fine. And the Bloom filter doesn't have to be that big; a 200 mb table should give you a 1% false positive rate, which is a mere 1000 extra hits in your case. This gives you 800 mb for the (in-process) database cache, which is way more than you need.

And yes, if you can spend the whole day on generating the results, you can skip the bloom filter even if you have a steam-powered 1.5 QPS database engine. :)

(There are fancy database engines that can do bloom filtering all on their own, but I'm not up to date on the Java universe so I'm not sure how common that is.)
posted by effbot at 1:17 PM on September 3, 2009


I think you are overthinking this. 150M records isn't that much data, unless the records are themselves large. A balanced binary tree would only be about 28 nodes high. Real world balanced trees would be worse, say 50 nodes tall. That only 50 comparisons per record. Even with disk access, you could run thousands per second. Unless you plan to scale this up, a simple sorted database would be fine.
posted by chairface at 9:30 AM on September 4, 2009


« Older Oversized digital prints   |   Dating a drug-user? Newer »
This thread is closed to new comments.