It's not you - my CPU says yes, but my database say no
November 16, 2007 6:54 PM   Subscribe

How many database reads/writes are okay per web page view? Also, what is better, big tables or lots of join tables?

I'm starting out in Web based development after years of C++, and this is the one topic I cannot seem to find a lot of information on.

When I have a database in the back end, I will need to read and write data to this database. Each page-open for each user will need to query the database. What is excessive, and what is good? Is 10 individual queries per page open too much? Is 100 too much? If I am only transfering small amounts of data, does it make a difference?

Let's say for example I have a user name, and I need to get some information about him whenever he logs in - say for example, his article view count. This changes every time he opens the page, and needs to be persisted in the database. Do I read and write to the database for every page open?

This is assuming no-caching.

The second part of the question - is it better to have lots of tables storing small amounts of data which are linked to each other using relationship ids, or is it better to use single tables with lots of columns? Which is more efficient?

And finally, in this example, what is better practise:

If I have 10 items in a database, and I need to associate the 10 items with 10 other values from the same database. Is it better to query the 1st item from the db, then query the matching item, and associate them? Or would it be better to query 10 of the first item at once, then 10 of the second item at once, then use a double loop to associate them? The first is heavy on database, the second is heavy on CPU. What is more scalable in the large scale?
posted by markovich to Technology (13 answers total) 5 users marked this as a favorite
The problem with your question is that it's akin to "How long is a piece of string?"

It's all dependent. It depends on the server, on the kind of pages being created, on the expected traffic patterns, on the required performance, and on a lot of other things like that.

No one can possibly answer your question.
posted by Steven C. Den Beste at 7:23 PM on November 16, 2007

1) For me, more than 20 queries or so per page is the limit. Unless the job I'm doing requires more, in which case I'd do it and monitor performance. That might mean looking into caching options or replication later. I'm not going to let a rule of thumb dictate what features I support.

2) Normalize your data. If this means a lot of small tables and joins, then go for it. But don't create a bunch of tables just because you think one is getting too large! Unless you have years of DBA experience, your idea of a large table is probably pretty ridiculous.
posted by sbutler at 7:38 PM on November 16, 2007

The question about number of database queries is not a very good question, because the answer to it will be anecdotal and you won't learn anything truly useful. With that caveat:

On my blog, a typical page view takes 80 database queries to put together with no caching (and still gets under 1 second pageload times with moderate traffic). With judicious application of memcached for things that don't need to hit the database every single time, that's down to 5 queries per pageload. I've seen up to 200 queries per pageload for forum software, and I'm sure you could go higher. I would call 300 the upper bound of reasonable. If you need more than 300 queries to put together a single page for a person to read, you're probably giving them way too much information at once.

As a web programmer, I've never asked the question you're asking. It seems like putting the cart before the horse. Instead, I built what I wanted to build, then figured out ways to make the load lighter so I could serve more pageviews and users with the same hardware.

As for your final question: clearly, you should let the database do as much of the heavy lifting as possible. You're writing in PHP, Python, Ruby, or some other web language that's interpreted and slow. The database and its query analyzer are written in tight, fast, close-to-the-machine C. Which one do you trust to perform well? I know I trust the database more. If the workload I'm throwing at the database is too much for it, time to tune the database. If that's not enough, start caching. Do you really need every single thing to be pulled live from the database and cached? I doubt it.

Write it naively first, then figure out what's slow, tune, cache, and optimize. Trying to outsmart the database engine's programmers is a waste of time.
posted by evariste at 7:45 PM on November 16, 2007

One thing I've found in my experience is that sometimes a lot of small queries are better than one big query.

The most pronounced example of this I came across was a monster query I wrote to get all the information from the database to display the index page for a forum. This query had a UNION, about a dozen JOINs, and a lot of fields, and while it worked when the database was small, it quickly bogged down. I broke it up into three smaller queries and the speedup was incredible. It also had the side effect of making things clearer, because it's really difficult to diagnose problems when you have a query like that.

As far as what's better, one big table or lots of small foreign-keyed tables, the answer depends on what you mean by better. From a design perspective, the normalized design where things are split up into separate tables, but you end up having to do more queries or more joins to get at all the data you might need for something. The upside is that it's easier to update some data, because you only need to update a single record to have it changed for everything that uses it. The monolithic table has a lot of duplication, so updates take longer and the database itself is less space efficient. It's easier to get data from, though, and it's easier to enter/change stuff by hand.
posted by Godbert at 7:46 PM on November 16, 2007

Oh yeah-you asked if are big tables better than lots of joins. The answer is "it depends". Lots of joins can kill your performance, so a lot of people practice what's called "Selective Denormalization", where you deliberately violate Third Normal Form in order to squeeze more performance out. You don't need to design this in from the start. Instead, you need to figure out when it's hurting you and then start selectively denormalizing.

Most people should design their database schema as the world was perfect and joins were free. Chances are for most workloads, sticking with a properly normalized database is the right call. If you're operating on the scale of Amazon or eBay, I'll listen respectfully to why you need to denormalize.
posted by evariste at 7:51 PM on November 16, 2007 [1 favorite]

Do you really need every single thing to be pulled live from the database and cached? I doubt it.
I boned that sentence. Should proofread more. Here's what I meant to say:
Do you really need every single thing to be pulled live from the database? I doubt it.
posted by evariste at 7:54 PM on November 16, 2007

By the way, your database probably gives you lots of tools that help you figure out what's slow. For instance, in MySQL, you can turn on the slow query log, then run a script to figure out which of your queries you need to pay the most attention to (because they wasted the most time, and happened the most frequently). Then you can ask the mysql monitor to explain a query, and figure out which indexes the query analyzer thinks it should use for that query, or whether it will use an index at all. If that looks wrong, then maybe you need to put new indexes your table. Benchmark, tweak, benchmark, tweak, until you get to a state where you're very happy with the performance you're getting. You should also check out mysqlreport for more tips on ways to optimize MySQL.
posted by evariste at 8:03 PM on November 16, 2007 [3 favorites]

The guys who write databases are probably smarter about writing databases than you are, and build in all sorts of tricks and optimizations to make it fast. Trust them, and go with a well-designed, normalized, data model unless your performance requirements are insane- and if they are, you shouldn't be tackling this as your first web dev project.

Also, spend a lot of time investigating and understanding caching. Tuning the DB to use more RAM is usually fairly simple, adding indexes is not complex compared to coding in C++, and (depending on your language, platform, and application) the vast majority of database reads can be skipped and pulled from an intelligent caching layer in your application.
posted by jenkinsEar at 8:23 PM on November 16, 2007

You're prematurely optimizing like a madman. Write your app, write it to do what you need it to do. If it turns out you're not getting the performance you (think) you need, then you can look into making it fast. In general though, you can cache such a huge percentage of most normal hits that the number of queries is a bad metric to look at.
posted by Skorgu at 8:50 PM on November 16, 2007

Pre-mature optimization is bad, having a feel for how the major underpinnings of your application works is good.

I can't find my copy to check for sure, but as I recally, Building Scalable Websites has good background on the things you are worrying about, when to worry about them, and approaches for dealing with them at the appropriate time.
posted by Good Brain at 10:07 PM on November 16, 2007 [1 favorite]

1. the number of queries is a poor concept to optimize around. Efficiency and cacheing are much more important.

2: Yes. You should read and write to the DB for this. If you are using a well indexed table, these queries will be fast and efficient

3: Databases are made for creating large, indexed data sets, and getting data in and out of them quickly. I have tables with millions and millions of rows, and getting data in and out is mighty quick due to the fact that they are well indexed and my queries are optimized.

And yes, Denormalize only when it becomes radically apparent that it's the only cost-effective solution (by this, I mean, when you start drooling at ridiculous solid state storage devices and think: this could solve all my problems!).

4)If you're really talking about large scale scale-ability, you need to consider the size of the objects your dealing with, the cost of instantiating the application, how to cluster it across multiple hosts, and a whole host of things that actually are probably more important than the number of queries per page.

For instance, it might be mighty quick to pull a whole table out of the db. And it's much slower to do a much more specific query that returns the 10 rows you need. However, when there are thousands of requests per second, the fast full table query will max your memory and kill you, and the slow, 10 row query will chug along just fine, taking more time per request, but with much more throughput, because your application doesn't have to deal with thousands of rows, thousands of times every second eating up all your ram.

You question is about where to move the complexity: into the application or the DB. I'd choose the DB, but that might not make sense for every application, or for queries that won't be cached by the DB.

What it comes down to is repeating what many other people have said upthread:

trust the DB, it was built by people far smarter than you to solve problems much harder than yours.

Build the application you need now, and optimize it when you have to because then you'll know what needs to be optimized.

Also, I must second the book "Building Scalable Websites". It's fantastic and one of the major reasons my servers are smoldering piles of lead right now.
posted by Freen at 12:22 AM on November 17, 2007

"... Let's say for example I have a user name, and I need to get some information about him whenever he logs in - say for example, his article view count. ..."

That's the kind of thing you'd commonly store in a cookie, and just play back to the user, anytime you could read his cookie. You'd probably would store it in your database, too, as a procedural fall-through in case he was logging in from another machine which didn't have your cookies, but anytime you found valid cookies, you'd not need to hit your database, period.

Anything you can do to manage state in conjunction with the user's machine spreads the load from your server, to your clients. Client side scripting, if you can use it, can be a Very Good Thing.
posted by paulsc at 3:04 AM on November 17, 2007

Most people should design their database schema as the world was perfect and joins were free.

Just calling this out because I agree 100%. As suggested above, you're going about this backwards and prematurely optimizing for a problem you don't yet have.

I don't know that I would store an article view count in a cookie or not (it seems like more trouble than it's worth, depending on how often it changes), but those are questions to ask once you run into performance problems, not before. If you do hit performance problems, there's a whole host of things you can do before you start flattening tables (caching on the web server, improving table indicies, looking for bottleneck queries, etc.). And if the normalization of the tables becomes confusing or hard to follow, you can always throw the relevant portions of the tables into views.
posted by yerfatma at 6:31 AM on November 17, 2007 [1 favorite]

« Older Outstanding online furniture stores for medium to...   |   Fruit Magic! Newer »
This thread is closed to new comments.