Load Testing the Hard Disk on a Win2k Server.
March 28, 2007 7:22 AM Subscribe

Are there any utilities for windows that continuously read/write to a hard disk?

I've got a server which seems to be having problems reading & writing to disk. It freezes randomly. Anyway, I'm looking for a utility that I can run in multiple Remote Desktop sessions that constantly reads and writes to disk. Has anyone got any suggestions for any software that'll do load testing like this?

posted by seanyboy to Computers & Internet (15 answers total)

You could probably write a batch file that copies a few files in an infinite loop.
posted by rfs at 7:32 AM on March 28, 2007

prime95 is often used by overclockers for testing CPUs.. not sure if it will help in this case tho.
posted by complience at 7:34 AM on March 28, 2007

You could run a filesystem benchmark tool like Iozone.
posted by jaimev at 7:39 AM on March 28, 2007

You say it freezes randomly which I take to mean that it does so even without extreme stress. If that's the case, your problem probably won't be found performing load testing. However, Mercury's Loadrunner is a good choice and they have a 10-day free trial. Is this a home-based server or for a company/business?
posted by sluglicker at 8:39 AM on March 28, 2007

Take a look here and see what strikes you. I don't have time to check them all, but if you really want to diagnose your drive, you'll need something to check its SMART status.

If you just want to put the drive through the wringer, look for something that does "butterfly seeks" -- a test that reads the very first sector of the drive, then the very last sector, then the second sector, then the second to last sector and so on.
posted by boo_radley at 9:06 AM on March 28, 2007

Seanyboy, I'd like to suggest a different approach.

While load testing has its place for sizing/capacity planning, diagnosing a bottleneck might be better served by taking perfmon traces of the offending system as the problem occurs and reviewing the data. Instead of merely simulating one theory of the problem, you can observe the actual problem and take corrective action on that basis.

On W2K, you can take a comprehensive performance trace locally or over the network. If you believe that network I/O is contributing to the bottleneck, obviously you'd want to take the trace locally so not to add any overhead that skews the data. Similarly, you'd want to save the binary logs to some other drive, so as not to add overhead there (since that's where you believe the problem resides.)

Run "perfmon", right click a blank area of the pane on the right, and choose "New Log Settings".

Set a descriptive name for the log, and then on the general tab choose "add objects". Add all counters from the following objects:

-Memory
-Physical Disk
-Processor
-Process
-Thread
-Network interface

and take the trace for a one or two hour run. If it's a one hour run, use a one second data interval. If a two hour trace, use two second intervals. Click on the log files tab, and specify to place the logs on something other than your data drives. Save it as a binary file.

Start the benchmark by clicking the disk under "name" on the right side of the screen, and select start. Let the trace run for the time you've decided (not more than two hours, and while you think the problem is likely to occur) and then use the same routine to stop it.

Once you have the traces, research the problem by viewing individual counters to determine where the bottleneck is occuring. Also, a good idea for review by others is to save the system information file along with the traces. To do this, run "msinfo32" and save the NFO file. Also, include the number and type of disks, as well as any RAID levels if that's what you have.

I'll try to help with this if you need help. E-mail is (my screenname) at gmail dot com, we can make arrangements to review the traces (which are pretty sizeable, zipping the .blg files and including the NFO compresses it nicely, and FTP works best for sharing them.)
posted by edverb at 9:07 AM on March 28, 2007

Oh, off the top of my head, Speedfan will give you SMART data and HDD temp (may or may not be useful here).
posted by boo_radley at 9:08 AM on March 28, 2007

sisoft's Sandra has a burn-in mode and can run benchmarks on your disk. Its free for home use, not sure if it will install on 2k server. As someone wrote above, you really want SMART after the initial burn-in.
posted by damn dirty ape at 9:08 AM on March 28, 2007

Also, troubshooting random freezing should also focus on memory (run a memory tester), device drivers, power supply, video card, etc not just disk.
posted by damn dirty ape at 9:10 AM on March 28, 2007

Thanks people.
perfmon definitely shows excessive disk usage just before and just after a freeze which is why I'm concentrating on that aspect. While the machine freezes, perfmon doesn't record anything. I just get big wholes in the log.

edverb: I may hold you to that... :)
posted by seanyboy at 10:07 AM on March 28, 2007

wholes = holes. Duh!

Speedfan won't load.

Running multiple Iozone sessions doesn't seem to be causing any freezing. Which is strange. I'll run a memory test on it, see if that makes a difference.
posted by seanyboy at 10:09 AM on March 28, 2007

Seanyboy, I'd like to ask for some clarifying points.

- What is the main purpose of this server?
- When you say "disk usage", what is happening specifically? Reads? Writes? Large files read sequentially, or random DB lookups, etc? Is the disk queue filling up? Is the CPU thrashing? Excessive paging?
- What type of storage is being used? Direct attached? Is there a RAID?
posted by edverb at 11:45 AM on March 28, 2007

IANASA. But, I asked the systems admin here at work about your perfmon story (the "not recording anything" bit). Two things came out of that: (1) this is extraordinarily weird, and probably very bad. (2) if you're lucky, there's a low level process (like a service) dying in a horrible fashion, resulting in that sort of gap in perfmon. There was some conjecture about the kernel having to switch execution mode that could cause this to happen. The money quote: "If that were mine, I'd back up data and get it serviced". Good luck, man.
posted by boo_radley at 2:57 PM on March 28, 2007

- What is the main purpose of this server?
Terminal server for a Database Application. The database used is a direct access database. i.e. Not Client Server.

- When you say "disk usage", what is happening specifically? Reads? Writes? Large files read sequentially, or random DB lookups, etc? Is the disk queue filling up? Is the CPU thrashing? Excessive paging?
Lots of Reads and quite a few writes over a large number of files. Also a large (up to 60) number of sessions.
I estimate that up to 30,000 files may be open at peak times, but we've seen the same behaviour when only 1000 or so files were open.
File Access is Random DB access.
Don't know if the Disk Queue is filling up.
The CPU isn't thrashing.
There's no paging at all. (We have about 8GB of ram in the machine)

- What type of storage is being used? Direct attached? Is there a RAID?
Onboard RAID onto SATA. We're using an intel motherboard with an LSI Logic chipset.

re: Performance monitor. We're also getting this beauty in the event log. Note, these messages do not match the timeouts.
The timeout waiting for the performance data collection function (function name) to finish has expired. There may be a problem with that extensible counter or the service from which it is collecting data.

The annoying thing is that we got this setup to work perfectly on a dramatically less powerful machine.
posted by seanyboy at 3:40 PM on March 28, 2007

A few things...generally speaking, terminal servers aren't well suited for direct attached storage -- you'd rather place processing burdens of 60 sessions on the term server hardware, and the disk/database demands on a different piece of hardware. But like you said, if the same config is cruising on much less powerful servers, maybe that's not an issue.

What RAID level is being used? How many disks? Any chance you could add disks to the array? Rule of thumb, for better performance: the greater number of IOPS, the greater number of spindles you want.

Three...you can check the disk queue length in Perfmon by checking physical disk: avg disk queue length. A rule of thumb here is that if the average is 2-3x the number of disks, that's a bottleneck -- the app is waiting for it to complete writing the caches to the disk. (I don't know what DB you're using though...there are caveats to this depending on the DB, for instance SQL 2005 hides this from perfmon...there's a SQL query to get the true physical disk queue length.)

You may also want to check the stripe size versus the average bytes/sec transfer. You'd want the stripe size to be greater than that figure, otherwise there could be a performance hit as the disk is performing multiple I/Os for a single request (for example, 64k byte writes to 8k stripes.)

Anything else of value in the event logs?

As for getting an accurate perfmon snapshot of the problem, you can increase the timeout value for data collection in perfmon by changing a value in the registry, see the entry under "collect timeout" here. Might help, might not.

Just throwing this stuff out there hoping something clicks for you...it's hard to say without more info.
posted by edverb at 4:25 PM on March 28, 2007

« Older Ever have your handwriting professionally analyzed... | Help me find an ethical bank in the US please! Newer »

This thread is closed to new comments.

Ask MetaFilter

Load Testing the Hard Disk on a Win2k Server.
March 28, 2007 7:22 AM Subscribe

Tags

Share

Load Testing the Hard Disk on a Win2k Server. March 28, 2007 7:22 AM Subscribe

Tags

Share

Load Testing the Hard Disk on a Win2k Server.
March 28, 2007 7:22 AM Subscribe