rellational database alternatives
October 19, 2010 10:10 PM   Subscribe

What is the best method to accomplish the following: Assume a database of every American currently alive. For reason beyond the scope of this message assume are in a relational database and they have record numbers 1 through 310,520,158.

I want to create attributes for each person such as gender, birth date, number of pet dogs, eye color, etc. I want to be able to create new ones on the fly as well as be able to remove them. (realizing full well additions could take a long time to index)

My number #1 concern retrieval speed: example I want to be able to list all one eyed, red haired women between the age of 33 and 35 that live in Colorado that have more than 10 pet cats. This would give a list of record numbers that satisfy my search.

My other concern is expense. I want to be able to put this on on powerful computer with a few hard drives in it at most. (unrelated but 3 terabyte hard drives came out recently!)

What technologies should I investigate other than relational database to accomplish this feat? Techniques or open source suggestions are greatly appreciated.

Thanks in advance!
posted by santogold to Computers & Internet (10 answers total) 2 users marked this as a favorite
If you use Amazon S3, you can store key-value pairs in a cloud environment. You store and retrieve data via HTTP requests.

Depending on how you craft the HTTP requests and the key naming scheme, you can use customized requests as an ersatz database query.

Or you could look at their SimpleDB service, which provides something more SQL-like — you're not reinventing the wheel.

You wouldn't need a powerful computer or lots of storage because Amazon takes care of the capital investment for you. You pay for data upload and download requests and ongoing storage.
posted by Blazecock Pileon at 10:27 PM on October 19, 2010

Assuming that you're comfortable working directly with the table definition, meaning that you don't flinch at adding columns to an existing table, then your best bet for speed and compactness will be a single table with a row for each American. You can add and drop columns easily with any competent RDBMS like MySQL, it's just generally not considered good practice to be continually mucking with the database that way. But if you're not in a production environment--if this is a data repository, not the engine of a business app or something--then there's no reason not to.

The only major drawback to this is if you have data that should be normalized. It's not wrong to create a table with columns like 'address', 'city', 'zip code', etc., but if you have a particular type of data that has an arbitrary number of entries (such as 'employers'), then it's poor design to have columns like 'employer_1', 'employer_2', etc.

In other words, if you just want to have a bunch of attributes that stand in a one-to-one relation with the basic entity defined for each row (i.e., a U.S. citizen), then go ahead with one table. If you can keep it to one table, then you're working with an RDBMS's strength in terms of querying the data and selecting arbitrary sets from it. I doubt you'll find anything faster.
posted by fatbird at 10:29 PM on October 19, 2010 [1 favorite]

Something like CouchDB might work well for your stated purposes. It is a JSON document-oriented database, and it is open-source.
posted by k8lin at 10:39 PM on October 19, 2010

What fatbird says. This is bread and butter for a relational db. Not sure why you ask for solutions "other than relational database" -- any alternative will basically be replicating well-established relational db functionality, and there are plenty of open source solutions in this space.

Indexing is critical, but it will blow out your db size and slow down your inserts (which you seem to anticipate). 300 million rows is substantial, but searches are O(log n) -- a perfect binary tree is about 30 levels deep, so it's definitely manageable. 3TB leaves 10kB per person, which should be plenty unless you're storing everyone's life history. (OTOH, how/whether something like MySQL can operate on a 3TB table is beyond my experience.)

I think you'd want to evaluate normalization with a critical eye, though. The commonly replicated values (home state, hair color, etc) can be stored in very small columns, and even though you say you're optimizing for query speed it would be a real pain to manage the tables of peripheral data. (For every person you add, for example, you'd have to do a find-or-create on separate tables for employer, hair color, home state, etc etc -- and ideally do a delete-if-last on those tables when removing a person as well. Ugh.)
posted by bjrubble at 11:16 PM on October 19, 2010 [1 favorite]

I have no idea what platform you're running this on, but one way you can get really quick queries on a data set like that is with an ADTree; that page is a brief introduction with a few pointers to additional information. An ADTree is a way of precaching statistics about a data set, so the more variables you have, the bigger the tree will be - theoretically they can grow exponentially in size (and also take a correspondingly long time to build), but in a lot of real-world cases (like ones where most arbitrary combinations of variables never happen), they're very efficient, and even the huge ones, once they're built, take practically no time to query.
posted by wanderingmind at 11:18 PM on October 19, 2010 [1 favorite]

If you have normalized fields then use a relational database. A semi-structured database, or document-based database, doesn't fit the data as you've described it. 500 milllion rows is commonly done in relational databases. Use PostgreSQL, not MySQL.

More importantly, software alone isn't a magic bullet and you'll need someone who knows what they're doing. I'm always wary of giving advice for massive projects on web forum and a big project like this needs an expert familiar with your concerns and who can tweak the database as it goes.
posted by holloway at 12:57 AM on October 20, 2010 [1 favorite]

Benchmark, benchmark, benchmark. Then, when you have further questions, benchmark. Every single person who says "Oh, but X technology is out of date, use Y instead!" is bringing their biases and perspectives to the table. Why trust hypotheticals when you can put together a benchmark meeting your requirements?

What you've described exactly matches the use case of a relational database. What benchmark did you use to disqualify e.g. Oracle or PostgreSQL?

Irregardless, you're also asking the wrong questions. You're concerned with a) lookup speed, and b) price, but you've jumped straight for technologies to use rather than methods and trade-offs to consider. You'll need to consider:
  • Horizontal sharding: Using more than one server to transparently represent a data set, e.g. PL/Proxy.
  • ACID (atomicity, consistency, isolation, durability) requirements: Are you willing to sacrifice ACID in exchange for performance? Most NoSQL databases sacrifice one or more of ACID (oh I know your dirty secrets MongoDB and redis).

posted by asymptotic at 5:01 AM on October 20, 2010

Maybe the keyword you are looking for is NoSQL. Either a relational db or any of the existing key-value (S3, Hbase, others), Document-oriented (MongoDB, Couch), or, if you want to get really fancy... some sort of map-reducey thing....

So, wins and losses:

SQL/RDBMS: if it's all one big table (and 300 million is biggish), then indexing after changes might be slowish. All rows need the same structure. Normalization and joins on the all the geo stuff might make sense. Wins: SQL exists, and the tech is very well understood. A solution to consider is partitioning on name letter, or something like that.

Document-oriented: Easy data insertion, searching, indexes (particularly on Mongo). CAP theorem. Hard to 'normalize' out the geo stuff. Win: different fields can be filled for different people... for example if you want to add fields for just people in MN or WA it is easy to do.

Other Key-Value stores: you have an obvious id (key). unfortunately, it's not one where the values really mean anything, so ordering on it doesn't help much.

(see other answers I have made about this topic for more details)

Me: I would partition on first two letters of last name (500 tables), and do it in Postgres, and store city and state normalized properly.

(other questions: do you want to store history? Changesets? If so, some of the K-V/doc stores make that easy... (hbase, couch, mongo).)
posted by gregglind at 7:36 AM on October 20, 2010

I just want to caution you on your idea that you'll want to have a small number of physical disks supporting this database. Generally you will get much better performance if your data is spread among a large number of small disks than a small number of large disks.
posted by Maisie at 10:35 AM on October 20, 2010

You do not want to be regularly adding and dropping columns to/from a 300 million row table.

Look into using a linked EAV (entity/attribute/value) table, keyed off of the primary key in your person table, to store your attributes.
posted by coolgeek at 9:43 PM on October 25, 2010

« Older The dymaxion projection   |   Hot Air Balloon ride near NYC? Newer »
This thread is closed to new comments.