Help me price a server
March 12, 2008 12:47 PM   Subscribe

Help me price out a server. I have a 10GB dataset from a company that I want to run SQL queries against. Right now on my MacBook they're taking ~12 hours to complete. Faster hardware is definitely needed!

The dataset is about 10GB. I'm using MySQL right now with a couple indexes per table (where it makes sense). Not much network access is needed, just enough for a couple people to connect and run the queries/view the results. My department has asked me to price out a server, and I feel comfortable asking in the $2000-$4000 range.

RAM? HD? Processor? Our department mostly deals with Dell.

For those that care, I'm doing some data modeling for a large company, but don't have access to their servers directly. They just send me the dump files.
posted by sbutler to Computers & Internet (15 answers total) 4 users marked this as a favorite
 
If you're doing data modeling, do you really need SQL? Assuming you're touching nearly every row in the db, depending on your skills, it might make more sense to work with the data dump directly rather than pull it into a db. When doing data mining, it's often many orders of magnitude faster to work with flat files on disk than to work with the same data in a database. That analysis that's taking 12 hrs on your Macbook, I wouldn't be surprised if you could do it in 30 minutes in Perl (really).

But to answer your hardware question, with the db route, RAM is the most important thing here.

I'd recommend starting with a motherboard that can hold sufficient RAM and looking for systems around it from there. In this scenario, I'd start with an ASUS motherboard that can take 8 (or maybe 16) GB of RAM, and get a white-box computer with said mobo (don't bother with Dell). Max it out the RAM and put in whatever hard drive floats your boat (get two and dedicate one to swap if you want to get fancy).

It sounds like you don't need a server, so much as something you and a few co-workers can run your analysis on. If so, a desktop machine can easily serve your needs - no need to go rackmount.
posted by zippy at 1:11 PM on March 12, 2008


Seconding zippy. 10GB isn't really a large dataset. If you get a 64 bit processor with > 10 gb of ram, you can hold the entire file in memory, and then can basically do what you want. Depending on the data you have, something like IDL might be more appropriate than a database. I've seen IDL used on some pretty massive mixed-type datasets successfully before.
posted by devilsbrigade at 1:30 PM on March 12, 2008


Don't discount a vendor like Dell for enterprise use by any means. The quality of the components and their service for replacements is extremely important. The time you save from having them build it or making you fiddle with it all to build goes without saying as well.

What are the exact specs of your Macbook? Knowing that would help spec what you need.
posted by joshgray at 1:33 PM on March 12, 2008


Also, if you go the SQL route, look at your queries very carefully. The 'EXPLAIN' command is your friend -- unnecessary joins/etc could really be hammering you.
posted by devilsbrigade at 1:33 PM on March 12, 2008


Looks like $3724 will get you

PowerEdge 2900 III
dual Quad Core Intel® Xeon®X5450
12GB 667MHz (12x1GB), Dual Ranked DIMMs
Windows Server® 2003 R2, Standard x64 Edition with SP2,Includes 5 CALs
2 x 250GB 7.2K RPM Serial ATA 3Gbps 3.5-in HotPlug Hard Drive, in Raid1
posted by joshgray at 1:37 PM on March 12, 2008


Agreed on tuning both your SQL database and the queries themselves - you could get a ton of gains just right there. ( i just like playing with hardware, personally :)
posted by joshgray at 1:40 PM on March 12, 2008


joshgray beat me to pricing a system. We priced essentially the same box, actually, though I had 16gb ram, no windows, & 4x147 15K RPM SCSI RAID 1+0.

Unless there's a pressing reason to get a headless rack server, getting a high performance tower will probably be easier and cheaper for you in the long run. Having direct terminal access is highly underrated imo.
posted by devilsbrigade at 1:49 PM on March 12, 2008


How many rows are we talking? I think putting that 3000 into someone with more database experience will get your query time down much futher then throwing a little more processing power at the job.

How many rows are we talking about here? Across how many tables? Are you just using the data dump as is, or are you renormalising it? Guaranteed you'll be able to cut a large portion of that query execution time off if you put more time into what you're doing with the data.
posted by Static Vagabond at 2:03 PM on March 12, 2008


I’m just going to add another voice to suggest trying to further tune your queries. If your data is fragmented, you may be suffering very bad cache usage at every stage from the filesystem and RAM to the RDBMS. PostgreSQL has an index clustering feature that keeps similar rows physically local. MySQL doesn’t have this feature for indexes other than the primary, that I’m aware of, but you can simulate it by inserting the records in order. I have seen a 10-fold increase in MySQL query performance after rebuilding a table with clustered records. See here for more technical analysis of data fragmentation: http://www.targeted.org/articles/databases/fragmentation.html
posted by breaks the guidelines? at 2:13 PM on March 12, 2008


Response by poster: widgets table: 550,000
orders table : 3,000,000 - 20,000,000; average 10,000,000

There's 8 order tables, one for each location. Most queries involve selecting from an order table specific orders (based on type) and then joining the widgets table.

This is part of a senior project involving an outside company. Unfortunately, all the CS people our group knows are busy with projects of their own.
posted by sbutler at 2:14 PM on March 12, 2008


Is this maybe an application for the Amazon Elastic Compute Cloud? It sounds like a good match for retro timeslice computing resources.

Also, you know about OLAP, right? Besides the commercial apps listed in the Wikipedia article there's Mondrian, an open source app.
posted by XMLicious at 2:21 PM on March 12, 2008


The table I was talking about comprises over 20 million rows, and weighs around 3 GB. A typical query executes in less than 10 seconds. It’s very likely that the records in your orders table were inserted in chronological order, and are therefore ordered that way on the disk. If your major selection criterion is a type column in the orders table, rebuild the table by inserting the records in the order of that column, and you will probably see a big performance boost, because the records having the same type will be stored contiguously on the disk, which means that fewer head seeks need to be done and fewer pages need to be read into memory.
posted by breaks the guidelines? at 2:32 PM on March 12, 2008


Honestly, you could probably put all that data into a copy of SQL Server 2005 (if you want to try it out, SQL2005 Express is free, but limited to a 4GB DB) and run queries against it before upgrading to better hardware. There's no reason why any reasonably competent DB guy couldn't make your queries and database work with average desktop hardware.
posted by blue_beetle at 2:33 PM on March 12, 2008


Seconding EC2. $0.40/hour gets you a very zippy dual processor AMD system with 7GB of RAM. $0.80/hour gets you 15GB and 4CPUs.

It's a real machine. I would recommend using SSH and you can run whatever runs on x86_64.
posted by yellowbkpk at 6:33 PM on March 12, 2008


Oh dear lord. You don't need a new server, you need some decent indices.

As devilsbrigade wrote "...look at your queries very carefully. The 'EXPLAIN' command is your friend -- unnecessary joins/etc could really be hammering you."

sbutler writes "Most queries involve selecting from an order table specific orders (based on type) and then joining the widgets table."

You need a composite index on (type, order_id, widget_id). This means running the following DDL:
create index order_type_id_widget_id on order(type, order_id, widget_id);

You'll have to do this for each of your eight order tables.

Possibly (you haven't shown us your DDL, so we can't know for sure) you'll need to add a few other indices.

How to figure out what indices you need: as devilsbrigade advised, for each query, do an explain. This just means typing your select, then prepending "explain ".
Select statement: select foo from bar where baz = 3;
Explain: explain select foo from bar where baz = 3;

This gives you 10 column result swet that shows what MySQL has to do to get your result.

In the explain result, every time you see it is not using an index when you are providing a where clause (that is, the "key" column of the explain result is null), add the appropriate index. Whenever the explain plan DOES NOT show an index in its "key" column or "using index" in its "extra" column you will gain by adding an index. (If "extra" shows "Using where" you're half way there; add an index on the columns in where clause if you find the query is too slow.)

Again, the index you most likely need (on each of the eight tables) is probably the one I showed above (modified by using the actual column names, of course).

You have only 20 million rows, at most. MySQL doesn't really begin degrading in performance before 30 million rows (unless your rows are really long; and if that's the problem, you can fix it with a combination of indices and table/column optimizations).

Again, you don't need a new server, you just need appropriate indices and possibly some normalizations of your data. Email me if you have more questions.
posted by orthogonality at 11:57 PM on March 12, 2008 [1 favorite]


« Older And no, an I [heart] NY shirt won't do...   |   Seeking advice on anti-virus and other security... Newer »
This thread is closed to new comments.