How is Facebook doing its queries?
February 4, 2008 9:49 AM   Subscribe

How does Facebook handle or simplify the presumably complicated DB queries involved so that me loading my page doesn't bring it to its knees?

When I pull up my homepage on Facebook, it pulls status updates for a subset of my friends, and recent bits of their activity. It filters out the ones that are related to people who have blocked me (hopefully none!) and then shows it all to me.

This information is all very specific to me, so presumably caching won't help much with it. But it seems like a hell of a lot of DB queries. If that's so -- it may merely be my "SELECT *"-level understanding of SQL making it seem that way -- how do they do it efficiently?

(I'm particularly interested in how the 'blocked user' can work: would it really pull a status update like "Matt Haughey and Jessamyn West are now friends", then pull Jess's block list to see if I'm on it? That's two hits for a one-line entry!)
posted by bonaldi to Computers & Internet (27 answers total) 83 users marked this as a favorite
 
For the blocked friends thing, there's probably a cross-reference table linking you to your friends. Jessamyn is removed from that table when she blocks you.
posted by mpls2 at 9:51 AM on February 4, 2008


I should say, the entry linking you to Jessamyn is removed from that table.
posted by mpls2 at 9:54 AM on February 4, 2008


De-normalized schema. They have everything in there several times so that the db can be optimized for query time.

They use a LOT of hardware. You think Facebook is complex? At work I can pull a query returning thousands of transactions out of Salesforce in less time than it takes to start Excel. And how? Big servers. And lots of little servers.

They use crazy db partitioning techniques to improve data locality.

They have lots of indexes.

But there's no one thing they do. They do a lot of things. It takes a lot of work to make the magic happen.
posted by GuyZero at 9:57 AM on February 4, 2008


This kind of problem is why building a large scale social web service is hard. Remember Friendster and it's many-months public collapse? It's easy to build a social networking system when you have less than 100,000 users.

If I were building Facebook, I'd get the database out of the main profile page path as much as possible. Store important bits of data in RAM with memcached or similar. When you build enough of a system this way it stops being simple caching and becomes an in-memory distributed object store.
posted by Nelson at 10:03 AM on February 4, 2008


mpls2: I think it must be more than that, because entries will come up between Friend X and Not-Friend Y, when presumably Y isn't on the table either.

If I were building Facebook, I'd get the database out of the main profile page path as much as possible. Store important bits of data in RAM with memcached or similar.
Yes, that's what I thought too. But there surely isn't much overlap across their thousands of users, and hence not a huge scope for cache. Unless they're doing it by your Location group.
posted by bonaldi at 10:07 AM on February 4, 2008


Honestly, if you want to know how big systems like that, look into the architecture of eBay, or Amazon, or Google. They deal with levels of data that are incomprehensible.
posted by blue_beetle at 10:10 AM on February 4, 2008


Facebook definitely uses lots of memcached.
posted by grouse at 10:11 AM on February 4, 2008


One of my colleagues told me he learned in a recent meetup that Facebook uses memcached extensively. Now while almost everyone big uses memcached extensively, I imagine they're probably storing a bunch of generated HTML in there. I agree with GuyZero about storing information several times. One thing they can do is every time someone performs an action, determine everyone that would receive knowledge of that action and append the generated HTML for the action to all of them. Then reads are fast, writes are slow. But there will be a hell of a lot more reads than writes.

If you're really interested in this kind of stuff, this article might be good reading, too.
posted by bertrandom at 10:14 AM on February 4, 2008 [1 favorite]


Thanks for those links, blue_beetle, I've been reading some of them while thinking about this. The thing that confuses me about Facebook is the conceptual side of it -- how exactly would the query for blocked be constructed, for instance, even if it would take 30 seconds to run on ordinary hardware?
posted by bonaldi at 10:14 AM on February 4, 2008


bertrandom: that makes a lot of sense to me -- storing up the info as it is created, not as it is sought. Thanks!
posted by bonaldi at 10:15 AM on February 4, 2008


Also, the required reading on this subject is Building Scalable Websites by Cal Henderson, the chief architect at Flickr. Maybe if you're lucky he'll post here, too.
posted by bertrandom at 10:27 AM on February 4, 2008


Ok, if you REALLY want to know how this stuff works, you look at examples that have been posted about other sites.

An excellent thing to read to get a handle on how megasites work with their data storage is:

http://danga.com/words/2007_06_usenix/usenix.pdf

This set of slides details livejournal and comes from the people who created memcached.

Danga has been very public about their setups and you can get lots of information by simple google searches.


Flickr has been vocal about their setup also:
http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html



Another good read is the Baseline Inside Myspace article:

http://www.baselinemag.com/c/a/Projects-Networks-and-Storage/Inside-MySpacecom/

The article covers alot, but the quick and dirty

# 500,000 Users: A Simple Architecture Stumbles
# 1 Million Users:Vertical Partitioning Solves Scalability Woes
# 3 Million Users: Scale-Out Wins Over Scale-Up
# 9 Million Users: Site Migrates to ASP.NET, Adds Virtual Storage
# 26 Million Users: MySpace Embraces 64-Bit Technology

---

And , now if you've stuck around to this point, I'll give you the hottest link in the bunch:

http://www.highscalability.com

This site culls info from interviews , articles ect and breaks down the database structures and other aspects of the megasites.

Read this site and you will know what it means when someone answers your original question with "Facebook? I imagine they are using a denormalized sharded multimaster slave setup with memcached handling the server selection and some of the more common selects"

And yes, I know saying "Sharded multimaster slave setup" is redundant and I could have just said "sharded setup"

Peter Azuolas
posted by petethered at 11:00 AM on February 4, 2008 [13 favorites]


Pretty much what bertrandom said.

1. You build the data as it is created not as it is requested.
So, your friend makes an update which will appear on your profile page. That update gets replicated locally to a piece of data that belongs to you. Getting that data is now a simple SELECT not involving 100 joins.

2. Caching at every level. If you decide to start hitting reload constantly on your profile page even though nothing is happening, I wont even take you to the database. Many cache solutions (I've used TimesTen not memcached) have an invalidation solution so that when an update does happen, I can take you off the cache.

3. Read slaves. To avoid too much load on one server, especially in a high read/write environment, create farms of machines which all serve Read requests. They get downstream updates from central update servers (Usually one update server unless you're doing something high-end multi-master like Oracle RAC) but these can be packaged up efficiently in tight streams.

4. Do queries on your metadata. Separate out your data and index the hell out of it for fast access. In the case of Flickr, they have separate image servers which dont get requests more complex than "get this specific image" or "here's a new image to add to your catalog" so you can optimize those databases differently.
posted by vacapinta at 11:03 AM on February 4, 2008


Vacapinta's number 1/3 is actually a very excellent point that I should have mentioned also. When you build something like facebook/megasiteX you look at your bottlenecks.

The two major operations are insert/update and select.

To handle the bottleneck of selects you would do two things ( outside of sharding your data ) , you setup a master slavepool ( or in the case of flickr master->slavemaster->slavepool ) for all of your selects. Your code must be written to take advantage of this, but by using the slaves for all your selects, the master database is spared the "millions" of requests.

I code in Perl and instead of accessing the database connection directly, all queries are passed through a control object. The control object handles the connections instead of having to worry about optimizing it in the application itself.

The default connection is to the slave pool and only when an insert/update/delete is done is a connection even made to the (correct) master server. You can optimize this even more by coding it so that the LAST things your program does is talks to the master so you can not waste time in the master connection.

Vacapinta's number 1 is probably better thought of as transfering the work to the application instead of the database. For example, instead of a join, you can do 2 simple index based selects and let the application join the data. Or, if you had a "most recent comments" , you would have a field in your DB that you stored this piece and whenever a new comment is added , this field is used to return the value instead of selecing the last 5 comments.

Peter Azuolas
posted by petethered at 11:56 AM on February 4, 2008


Oh, a good example of pregenned data ( or atleast it was , I haven't looked in a while ) is Blogger.

You must/used to have to do a rebuild of your blog whenever you changed the layout. That implied that all the pages where pregenned so when it came time to show a page the generated one was served instead of building the page dynamically.
posted by petethered at 11:59 AM on February 4, 2008


... store important bits of data in RAM with memcached or similar.
surely isn't much overlap across their thousands of users

There doesn't need to be a lot of overlap. RAM is cheap. Let's make a guess. Say Facebook has 10 million users, each with 100 friends. Then that's 1 billion friend relationships. Let's say it takes 16 bytes to store a friend pair (2 4 byte IDs, another 8 bytes of overhead crap). That comes to 16 gigabytes of memory to store your entire social graph, or under $1000 of RAM.

Of course you need machines to serve the RAM and lots of software to keep data consistent and replication and blah blah blah. It's going to cost a lot more than $1000 to actually build it. But basically, RAM is really cheap and you can keep a lot of a big application in it. When you think that way it's no longer caching, your system is composed of in-memory object stores.
posted by Nelson at 12:16 PM on February 4, 2008


Thanks guys, I'm very grateful for all of that.
posted by bonaldi at 1:37 PM on February 4, 2008


There's no reason to do SQL queries at all, you can store everything in memory with data structures specific to your application. memcached would be a generalized example of this idea. I had a database backed website hosted on a slow machine and I used a custom memory structure to store results, which dropped page loading time from 700ms to so quick it couldn't be measured.
posted by delmoi at 2:01 PM on February 4, 2008


Parallelization of asymmetric database retrieval is cheap. You can literally throw commodity computers into a rack and build as needed. Lots of reads, fairly little writes and then everything becomes a rather technical issue of scaling.
posted by geoff. at 2:18 PM on February 4, 2008


And to add one small point, the system only has to cache active users' data, so that maybe the first page load is slow (perhaps the login page) but after that everything is really fast. So the in-memory caching isn't multiplied by total # of users, but the total # of active users, which at any moment is a lot less.
posted by GuyZero at 2:30 PM on February 4, 2008


I know nothing for sure about how Facebook does what it does, but there are a couple of big hints that the database partitioning it uses are network-based: Also, for people's home page with all of the most recent updates, time is another thing to partition it on. The dataset might be manageably huge if it is restricted to only the most recent 24 hours.
In the case of Flickr, they have separate image servers which dont get requests more complex than "get this specific image" or "here's a new image to add to your catalog" so you can optimize those databases differently.
Facebook is almost certainly doing this as well, as there used to be some sort of URL trick where if you looked for

http://facebook.com/get?user=george&album=01

you would get an error if you did not have permission to view the album, but if you typed

http://facebook.com/get?user=george&album=01&pic=01

you could see picture 01, even if it was private. This was patched in a hurry.
posted by Maxwell_Smart at 7:48 PM on February 4, 2008 [1 favorite]


As a minor aside; the recent updates thing (where you see what other people have done) appears to be generated in a sort of offline batch job. Often I will notice something while directly viewing someone's profile (e.g. Steve has added SomeLameApplication), only for it to show up on my news feed several hours later.
posted by jon4009 at 3:41 AM on February 5, 2008


"You must/used to have to do a rebuild of your blog whenever you changed the layout. That implied that all the pages where pregenned"

Blogger wasn't a hosted service from the beginning; they uploaded the entire site to your own host via FTP. When they introduced BlogSpot, they kept using the old code, so everything was static on that site too.

They switch to dynamic generation for BlogSpot when Google reengineered the whole platform:

http://googlesystem.blogspot.com/2006/08/new-blogger.html

(after all, if you have full control over every little piece of the web server, generating full HTML pages up front isn't necessarily more efficient than building them on the fly from prepared page fragments. you can do lots of stuff "on the way out".)
posted by effbot at 5:06 AM on February 5, 2008


"the recent updates thing (where you see what other people have done) appears to be generated in a sort of offline batch job"

Or just aggressively cached; all the typical facebook user cares about is that he/she gets information quickly, not that the information is 100% up to date. It's more important to display *something* than it is to provide a fully consistent view of the data in their databases:

http://www.mnot.net/blog/2007/12/12/stale

Or in other words, to build big stuff, it's all about figuring out the best way to cheat your way around Brewer's Conjecture for your specific application.
posted by effbot at 5:18 AM on February 5, 2008


Well, for one thing, their schema is simple as fuck. The banking institution I work for is at least an order of magnitude more complex, and most of our stuff is transactional, which means the queries can't be "close to exact". When you swipe your credit card, the credit limit has to be correct that very second. A Facebook query only has to be "basically" correct. It could take several seconds to cascade an update to all their servers, and it wouldn't make any difference to the end user.

Once you start getting into heavier objects, you quickly realize that all those hand-coded queries aren't going to cut it and you switch to ORMs -- object relational models. A popular open-source one is Hibernate. What an ORM lets you do is code with objects instead of queries--the queries are dynamically generated. So, instead of something like this:
select p.* from users u join roles r on u.role_id = r.id join privileges p on r.privilege_id = p.id where u.id=12345
...which will give you a List of Privileges for a particular user, you can do this:
User u = getUserDao().getById(12345);
List privileges = u.getRole().getPrivileges();
This works because you've created mapping files for each object that correlate to a table in the database. When you fetch the object, Hibernate creates the initial query for you, and lazy-loads any elements of the object graph that you don't necessarily need right away. That just means it keeps a placeholder for you until you try and access it, at which point it generates the SQL and makes the request.

The cool part about it is that with a normalized database, you can cache the ever-living shit out of some tables. In the above example, your User table probably gets updated all the time. But how often do new kinds of privileges get added? Hardly ever. So you can cache the results of any query that hits the privilege table in memory. The system basically holds a Map of name->value pairs, like "Privilege with an ID of 3 has a value of 'view credit card details'" or "Privilege ID 5 has a value of 'edit address details'" so whenever anyone joins to the privilege table, it doesn't have to do a database lookup. It already knows and substitutes the value in.

The awesome part about it is that the ORM system is smart enough to know that any updates to a particular table will invalidate the data and require another database fetch. But provided you're not updating things left and right, you can do huge queries (and subqueries) without ever touching the database--at least, after the first "touch" to fill the memory up. Most ORMs will allow caching to the disk, so if your power suddenly goes out and you have to restart one of the app servers, it can repopulate the cache without ever hitting the database.

Anyway, that's one of the fun things about my job so I tend to like to talk about it even when it's not completely relevant. In Facebook's case, they don't need any of the above fanciness 'cause their schema is stupid-simple. And how many rows are we talking about here? 15 or 20 milion rows in the Users table. You can split that up into a bunch of smaller tables easily. User ID 0 through 1,000,000 in one table, User 1,000,001 through 2,000,000 in the next. The schools table might have 100,000 rows. That's not big. A few indexes and an assload of replicated servers and you wouldn't ever feel it. And since you don't have to maintain any kind of transactional integrity, scaling is just a matter of throwing more iron at it. Color me unimpressed. :)

By means of comparison, Sprint's main database has 2.85 trillion rows. AT&T is close behind with almost 2 trillion. And you know a bunch of those tables are transactional. Think about that the next time you're impatiently waiting on hold for customer service because "their computers are running slow today."
posted by Civil_Disobedient at 7:57 PM on February 5, 2008 [9 favorites]


CREATE MATERIALIZED VIEW.

Note: MySQL can't do this, so open-source developers nearly always discuss workarounds. I expect Materialized Views to be the hottest design feature in Ruby on Rails 4.0, circa 2014.
posted by mosch at 7:58 PM on February 21, 2008


And I'm with Civil Disobedient. Facebook's database only seems large if you haven't worked with commercial data.

The solutions are complex not because the problems are hard, but because MySQL is a crappy database.
posted by mosch at 7:59 PM on February 21, 2008


« Older Is there unix-compatable alter...   |  I received a registration e-ma... Newer »
This thread is closed to new comments.