Join 3,441 readers in helping fund MetaFilter (Hide)


How do I NoSQL?
July 10, 2010 11:42 AM   Subscribe

My bread and butter is and always has been data stored in relational databases. I keep seeing a lot of material on key-value storage (also known as the NoSQL movement) but haven't had as much success finding resources on how to move from relational to non-relational. Any suggestions?

I know that some of this involves denormalization, but how? Step-by-step guides where someone takes a heavily relational database and moves it to a key-value system would be appreciated, as well as gentle guides for the non-relationally challenged.
posted by Deathalicious to Computers & Internet (10 answers total) 19 users marked this as a favorite
 
Well at least with Cassandra you can group key/value pairs into column groups which lets you create structured rows. So instead of having a names table joined on an addresses table joined on a phone numbers table, you'd just define all your desired fields as a column group that is collectively accessed at once.
posted by Rhomboid at 12:36 PM on July 10, 2010


Why are you doing this? What bottlenecks are you hitting? A key/value approach can often be a solution in search of a problem. Having said that, you could try Building Scalable Websites by Cal Henderson.
posted by Leon at 12:59 PM on July 10, 2010


As I understand it, the NoSQL movement isn't aiming to replace relational data modeling with something that's superior; it's a reaction to using MySQL for every damned website out there as the default because a lot of data-driven stuff isn't usefully normalized along the relational model--simple key/value modeling is more than sufficient (and often more performant), but there haven't been a lot of options besides MySQL in general, and BerkelyDB for key/value mdoels (which isn't the best for performance reasons).

In other words, you're not likely to find someone deconstructing a heavily normalized database into a key/value model because no one thinks that's appropriate--data that's susceptible to heavy normalization is the best use case for a relational system.

To see NoSQL in action, look at memcached. It's a distributed key/value database that's used heavily by sites like LiveJournal and Facebook to provide a very performant caching layer. That's a strong use case in the NoSQL movement.
posted by fatbird at 2:15 PM on July 10, 2010 [2 favorites]


As others have pointed out, NoSQL is meant to supplement RDBMS's and only supplant them where it makes sense to do so. I happen to be very partial to MongoDB, which (like CouchDB) is a document-oriented db. What does that mean? Basically, if you serialize your *thing to store* to JSON, or the like (i.e., it's just strings, ints, lists, hashmaps/dicts), then you can just store it *as is* in the db, and search on it directly.


Contrast this to some of the other options presented like Cassanda, BigTable, HBased, BerkeleyDB which are more "key-value" stores, OR things like Neo4j, HyperGraphDB, which are graph databases. If you want a Graph database, you'll know it :) General graph traversal in RDBMS's tends to be, um, *hairy*.

---
More concrete: imagine a user profile with this stuff -> name, password, some email addresses, some phone numbers, avatar image etc.

In an RDBMS, the ones with "multiples", like email, phone numbers, etc should really be stored in different tables, and pulled back together using joins.

In a document-oriented db, (such as Mongo), you just stick the *whole thing* in as a 'row'. And retrieve it as a whole row, no joins.

----
There are a lot of trade-offs.. your framework of choice probably expects SQL (unless it doesn't!); RDBMSs have lots of tools. People know sql!

But....

Often NoSQL solutions can get around all the things you need those extra tools for already!
In general, NoSQL stuff tend to put much of the validation / constraint checking back on the App rather than in the db.

(I consult on this stuff, so let me know if you have more questions!)
posted by gregglind at 2:25 PM on July 10, 2010 [1 favorite]


I've done some projects in both relational and non-relational databases. I'm not an expert in either, but I do know my way around both. I haven't found any good guides either for transitioning from relational to key-based, but I have seen a growing number of questions like yours, so it's quite likely something is out there, or soon will be. Having said that, here's my thoughts based on my workings with both types of data structures:

Short Version* (which tries to answer your question as direct as possible)
1) In your database, you have tables and relations between those tables (one-one, one-many, many-many). Those relationships are not going to be part of your key-value datastore. Those relationships would be defined by logic that exists outside of the datastore. In other words, in a key-value datastore, data relationships are part of the business logic, not the datastore logic. This is a subtle but important point.

2) In a relational database the tables themselves are structured. They (usually) have a particular field to act as a id or key field, and then a set number of "columns" for actual data content. Key-Value storage is unstructured, there is just the key that points to "data". This data *may* have structure, but if it does it's up to the logic outside of the datastore to make sense of that structure.

3) Migrating from a relational database to a key-value datastore. First, identify all relationships between tables and understand that those relationships are going to have to be modeled in the code managing the datastore. The second step is more tricky, as it depends on why you want to migrate from a relational to key-value model, but essentially it requires you to model the data in the tables into key-value relationships. Note that these can take the form of key => { subkey1: value, subkey2: value } where you can create an arbitrarily deep structure of nested key-values (though keeping it as flat as possible is usually best for performance).

Like the others have mentioned, whether you use a RDBMS or a key-value datastore depends on what it is you want to do. If you have static, complex relationships between data that are handled readily by an RDBMS then you should definitely use an RDBMS. If your data structures change dynamically based on business logic or data content (my reason for going NoSQL), or (as others have also pointed out) there are no complex relationships than a key-value flavored datastore might work for you.

Here's some rules of thumb.
* If your app can keep the data it in a flat file (think JSON) and work fine, use a key-value datastore (better scaling and performance).
* If you can model your datastore needs in an RDBMS. Then use an RDBMS.
* If you can't model your datastore needs in an RDMBS model, then you might have better luck with a key-value datastore (but you would need to code the model in addition to having the key-value datastore.

*There is no long version, because the short version was already so long.
posted by forforf at 2:51 PM on July 10, 2010 [1 favorite]


How one migrates data from relational database to a k-v store is very much dependent on both the schema and the use cases. There are certainly patterns, many of which, like you mentioned, utilize denormalization. So a for instance, tags on blog posts:
In a relational database you end up with something like:
Table posts
mediumint id (pk)
varchar title
varchar descr

Table tags
mediumint id (pk)
varchar tag_name
mediumint post_id (fk)

You then join the two tables matching the tag's post_id to the posts' id.
In a non-relational database you end up with something much more denormalized:
in a document db, you would have a document for each post, which in turn would have a list of tags. If you don't have good secondary indexing, you're going to denormalize by creating a document for each tag, which would have a list for each post with that tag. So:
Document posts
int id (key)
string title
string descr
string[] tags

Document tags
int id (key)
string tag
int[] posts // This is not fully dernormalized. If it were this would be posts[] posts, but let's not get wacky.
A k-v store would be more or less the same. The basic idea would be, without joins and secondary indicies, you are going to make multiple writes where there used to be one.

So that's a little example, but if you want advice:
First stop thinking about NoSQL as if you can learn about NoSQL. As others have said, pretty much anything that isn't Postgres, MySQL, DB2, or Oracle is now called NoSQL. Figure out what you are really looking for (speed, flexibility, distributed capabilities, more complex relationships and hierarchies), then look at what is available in those specific fields. But if I was going to make a survey course, I would:
- Read http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf. It's a little outdated in some sections (it is a whole YEAR old), but is a great primer.
- Download and play with redis. Now. If you are on a system with a credible gcc implementation, you will be writing data into Redis within the next 5 minutes. Now, Redis, has gotten a little more complex over time, but it's still really easy to jump into. It is also one of the more elegant pieces of software I have ever dealt with. Note the existence of data structures (sets, lists, sorted sets, hashes), expiration primatives, and blocking operations. If you want to learn more about it, here is (long) talk from Salvatore, the creator of redis: http://www.ustream.tv/recorded/7855635.
- Learn about mongodb. It's a pretty controversial little beastie, but it's very interesting. There are lots of interesting talks @ http://www.10gen.com/event_mongosf_10apr30. Once you get a good grip on the schema/query design, spend a little time learning about the Ops side of Mongo. The entire way of thinking about deploying mongo is different.
- Read http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model. Cassandra is a very cool project and their approach to db design with be studied for years to come, even if it never really mainstreams (a la erlang, lisp, smalltalk). If at the end of this, you want to learn more, I'm sure you can dig up more resources.
- Somewhere along this journey, when you start to wonder how the hell we got down this crazy path, you are going to want to read the following papers:
- CAP theorem - http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
- BigTable - http://labs.google.com/papers/bigtable-osdi06.pdf
- Dynamo - http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Godspeed. Me-mail if I can help.
posted by yeoldefortran at 3:39 PM on July 10, 2010 [4 favorites]


Sorry, forgot askme doesn't automatically highlight urls... sigh.
posted by yeoldefortran at 3:41 PM on July 10, 2010


Thanks for the answers. The main reason I asked this question is it seems like alternate data storage systems are becoming more popular and at some point they're going to be a tool I need to use rather than choose to use. At that point I'd like to be more familiar with them.

If I was setting up something like a Facebook my instinct would be to use a lot of joins and a lot of related tables, so a big part of my interest is how to set up systems like that that don't use that.

Part of the interest also comes with my recent work in Django. Seems like a lot of people are getting excited about using MongoDB and CouchDB with Django to create apps that can serve a lot of people quickly and efficiently. It's more appealing to me as well because Django has a habit, unless you're very careful, of making many, many database queries.
posted by Deathalicious at 11:47 AM on July 11, 2010


Thanks everyone. My interest at this point is mostly curiosity. I'm not doing any active migration right now but it'd be nice to get a handle on it in case I might need to do it in the future.

Most of the sites I build don't really get the kind of traffic where this is a real consideration yet, but I have used memcached to cache pages etc.

An addon question is why json can actually be a more scalable way to store data. Seems like a database structure would be more optimized.
posted by Deathalicious at 12:52 PM on July 11, 2010


Re: json

Document-oriented (free form) DB's *aren't* more space efficient, and if I gave you that impression that was incorrect. The wins there are:

* no joins
* easy to understand
* the 'null' problem goes back to the app / business logic, where it belongs.

If your data has a standard structure, then having to repeat the keys in every row is space inefficient.

Also, there are lots of kinds of scalable! Programmer time, lookup time, indexing time, storage size, etc. Pick which ones you need, and choose a tech accordingly!
posted by gregglind at 2:48 PM on July 27, 2010


« Older How to make legal information ...   |  I'm looking for a one-week liv... Newer »
This thread is closed to new comments.