My Kingdom for a DBA
January 31, 2007 5:18 PM   Subscribe

Databasefilter: I'm a self-taught 'DBA' (I use quotes so as to not offend those who have this knowledge through a paid education). I'm having an efficiency problem with a few of my tables, and I thought maybe one or two of you could help.

I've got a table that looks like this in SQL 2000:

UnitValue (
dts smalldatetime,
fund_id int,
price decimal(10,4)
)


Here's the problem: For the last 10 years or so, we've been tracking anywhere up to 150 fund values on a daily basis. The table size is now unweildy, and indexes don't seem to help me.

A thought presented itself to me today, but I'm having some trouble figuring on how I'd use it. I thought, perhaps, that I'd structure the table in this manner:

UnitValue (
dts smalldatetime,
fundAname decimal(10,4),
fundBname decimal(10,4),
....
)

...where each column name relates to a particular fund. Then, I'd create indexes for each column relating to a date.

I'm thinking this would greatly reduce the size of my table, and the indexes would do a great deal in reducing query time.

My question is: How do I query against this table? Take, for example, the fact that participants in specific plans that I'm managing could have 1 to X number of funds, so to acquire their total balance for any given day, I have to join against the UnitValue table to find their particular mix of fund values for that day.

I could use "dynamic SQL", I suppose (by building the query programmatically in a large string variable) and then evaluate the SQL code after I've built it up, but wouldn't that offset any performance gains I'm getting from structuring the table in this manner? I was under the assumption that a stored procedure couldn't optimize queries built in such a manner. Besides, coding that way is hell on my inner debugger.

Is there a way to join against the SQL database tables in a generic manner to figure out to which columns I need to pay attention for any given participant?
posted by thanotopsis to Computers & Internet (20 answers total)
 
Yoiur idea is clever, but ultimately bad. Because each row in your giant matrix table will be really really big, your query will likely take longer, and building the SQL, as you note, will be an error-prone pain, especialy when you add a new fund.

However, there's a similar technique that may help: partitioning the table. Basically the idea is to take your one large table, and split it into several tables. One might hold only dates for 2000, the next for 2001, etc. I wouldn't do this manually unless you have no other choice, but your database may have the ability to transparently do it for you.

But before you do that, let's examine the indices your actually using, and the queries you're actually making. I suspect that we could come up with better indices and speed up your queries significantly.

Please post both the queries you're making, the showplan for them, and the indices on those tables for further assistance.
posted by orthogonality at 5:36 PM on January 31, 2007


I would strongly recommend against giving each fund it's own column, I think you'd quickly find that administration would be awful, and I'm pretty sure that it wouldn't help you since you would be making your table so wide.

With that being said, you should figure out where your performance bottleneck is. Have you checked the defragmentation of the indexes (using DBCC SHOWCONTIG and looking at scandensity), or updating the statistics? If your estimated row count in QA is vastly different than what you get for results, you might just have statistics that are out of date (you can check when they were last updated with DBCC SHOW_STATISTICS)


Also, are you sure all of the queries have correct covering indexes? You could try using the Index Tuning Wizard for some tips.

Assuming none of those suggestions help, I would use performance monitor to check counters like Avg Disk Queue Length, and some of the SQL Server specific ones (I won't go into more detail here, you can google around and find more detailed articles about which ones to watch)

and on preview, orthogonality's comment about partitioning is a very good one, assuming the size of the table is the ultimate drawback.
posted by ilovebicuspids at 5:42 PM on January 31, 2007


You really need to normalize.

table Fund
* fund_id (int)
* fund_name (varchar)

table Unit
* id (int)
* price (money)
* date (datetime)

table FundUnit
* id (int)
* fund_id (int)
* unit_id (int)

To find the total balance for a fund for any given day:
select sum(u.price) from Unit u join FundUnit fu on fu.unit_id = u.id join Fund f on fu.fund_id = f.id where f.id = [index of the fund you're interested in]
posted by Civil_Disobedient at 5:47 PM on January 31, 2007 [1 favorite]


Response by poster: To answer orthogonality: Well, all of my data is primarily based off of a table called Eventlog. Eventlog's purpose is to track Buy and Sell actions based on these funds.

So, for example, say you buy 200.00 of Fund A last Friday, when it was worth 10.00. You want to know how much money you have today, when the fund is worth 12.00. You'd first take 200 / 10 (the price on the day you made the purchase) and multiply by 12. You're finding out, initially, how many "shares" you purchased, and multiplying that amount by the value on the current day.

So, right now, I'm precalculating the Units (or Shares, as you might call them), so the Eventlog table shows an amount in dollars, a fund_id, and an amount of units.

However, the complication happens when you consider the actions that could be taken from when you initially made the investment to "today". For example: You could have transfered money out of that fund, you could have been paid on a benefit schedule, or you could have invested even more money.

So, the resulting query looks something like
select sum(case when action = 'Buy' then amount else 0-amount end / unitvalue.price * unitvaluetoday.price)
where the transaction date in the eventlog is less than or equal to 'today'.

(I can't give you the exact code, unfortunately, as that would be breaking any number of legalities. I don't think I'm giving anything away about business practices here, as I don't believe I'm unique in my approach).

I have, through Enterprise Manager, declared indexes on those fields that are important to me in the WHERE clause of my queries; that is to say, dates, fund IDs, account IDs, and the like. The water's clearly been over my head for years now on the type and function of particular indexes, but I've poked and prodded at it until I've determined that I've squeezed everything I can out of what I've done so far.
posted by thanotopsis at 5:52 PM on January 31, 2007


Response by poster: Civil_Disobedient: You really need to normalize.

Well, you've lost me, there: How does that help the situation? The Units table is just as big, isn't it? Sure, it's less wide by one column, but it has just as many rows, and now I have an additional table with one row each for each Unit row. Joining those two monster tables together would seem like a less efficient way of doing things than just having them in one table.

[/ignorance]
posted by thanotopsis at 5:55 PM on January 31, 2007


Response by poster: ilovebicuspids: With that being said, you should figure out where your performance bottleneck is. Have you checked the defragmentation of the indexes (using DBCC SHOWCONTIG and looking at scandensity), or updating the statistics? If your estimated row count in QA is vastly different than what you get for results, you might just have statistics that are out of date (you can check when they were last updated with DBCC SHOW_STATISTICS)

I'm not sure how to interpret the results of this, and DBCC SHOW_STATISTICS tells me I have the wrong number of parameters.

Here's what Showconfig returns:

Table: 'UnitValue' (1433680801); index ID: 1, database ID: 7
TABLE level scan performed.
- Pages Scanned................................: 1116
- Extents Scanned..............................: 143
- Extent Switches..............................: 299
- Avg. Pages per Extent........................: 7.8
- Scan Density [Best Count:Actual Count].......: 46.67% [140:300]
- Logical Scan Fragmentation ..................: 10.48%
- Extent Scan Fragmentation ...................: 36.36%
- Avg. Bytes Free per Page.....................: 637.6
- Avg. Page Density (full).....................: 92.12%

posted by thanotopsis at 6:03 PM on January 31, 2007


thanotopsis writes "So, the resulting query looks something like
"select sum(case when action = 'Buy' then amount else 0-amount end / unitvalue.price * unitvaluetoday.price)
"where the transaction date in the eventlog is less than or equal to 'today'."


Well, I can't profile a query that "looks something like", but it looks like at least three tables are joined. You need to run "explain" or showplan on the query, and see where indices are not being used. Then you need to add the right indices, and, and run explain again to make sure they're being used.

Civil_Disobendient is right about normalizing, but I think from want you've mentioned that your tables likely are in normal form.


thanotopsis writes "(I can't give you the exact code, unfortunately, as that would be breaking any number of legalities. I don't think I'm giving anything away about business practices here, as I don't believe I'm unique in my approach)."

If you can't get this working better yourself, consider hiring a consultant under an NDA. I'm available for that (and so are other people); you'd not be the first mefite to hire me to consult on SQL optimization under an NDA.
posted by orthogonality at 6:06 PM on January 31, 2007


Well, you've lost me, there: How does that help the situation?

Sorry, I was referring to the fundAName idea; you're right in that it wouldn't reduce the number of total rows in your transaction table. It doesn't seem like there's any way around that, however, unless you wanted to make 150 separate tables for each fund that contained rows for the dates and amounts, or as Orthogonality mentioned, split the dates into separate chunks. How flexible do you need to be with the queries? Is the situation you described (summing amounts of particular funds over time periods) pretty-much the only use for the table?
posted by Civil_Disobedient at 6:27 PM on January 31, 2007


Response by poster: Civil_Disobedient: Is the situation you described (summing amounts of particular funds over time periods) pretty-much the only use for the table?

The only other time the table is used is when I'm reporting dry historical data -- i.e. a plain select across the table for a particular fund ID to show changing values over time.
posted by thanotopsis at 6:35 PM on January 31, 2007


Running DBCC SHOWCONTIG without only the one parameter will scan the whole table, instead of a specific index. Are you sure your queries are even using your index?

Ideally, your scan density should be as close to 100% as possible, so the 46.67% indicates that you have external fragmentation in your table. This is a good explanation of how fragmentation can hurt performance. There is a lot of different things that can impact performance, and without going into too much detail, you should at least try defragmenting your indexes (using DBREINDEX or DBCC INDEXDEFRAG) and updating your statistics (UPDATE STATISTICS) and comparing before and after query times. Just a warning - these commands will make hit your database hard, so I wouldn't run them when users are running queries, I schedule mine to run overnight.

The SQL Server help file (known as Books Online, which I never quite understood) is actually extremely helpful for examples and descriptions, you can read up on some of the commands to get a basic understanding of how they may help you.
posted by ilovebicuspids at 6:51 PM on January 31, 2007


I'm no DB expert, but you may get different performance based on whether your key/primary key is (DATE,FUND_ID) or (FUND_ID,DATE).

If you're primary keyed on (fund_id, date), it could mean that, when you query a bunch of funds on a particular date, you're jumping around the table a lot, which would give you bad cache behavior/performance.

However, while primary keying on (date, fund_id) would improve that case, it would make querying all the value for a particular fund slower (since each fund would be split throughout the table.) In the latter case, (fund_id, date) would be better.

Anyway, if you haven't already, you might try (safely!) changing the primary key to each of those to see if it helps you out.
posted by blenderfish at 8:22 PM on January 31, 2007


Play around with multi-column keys if you haven't already, I guess, is what I'm saying.
(i.e. "ALTER TABLE UnitValue ADD UNIQUE KEY (dts,fund_id,price)" or ..."(fund_id,dts,price" or both.)
posted by blenderfish at 8:30 PM on January 31, 2007


I don't feel there's enough information available here to really provide a decent answer but I'm going to give into temptation and offer you some advice.

I have a sneaky feeling that putting a sub-query at the heart of your existing query instead of a table/view may provide significant benefits (if you prefer a temporary table that's fine too).

So instead of :

select sum(case when action = 'Buy' then amount else 0-amount end / unitvalue.price * unitvaluetoday.price)
where the transaction date in the eventlog is less than or equal to 'today'.

... try (the equivalent of) ...

select sum(case when action = 'Buy' then amount else 0-amount end / unitvalue.price * unitvaluetoday.price)
from (select * from the eventlog where the transaction date is less than or equal to 'today').

... as I say I don't feel you have provided enough specifics to be sure but I think this is worth trying. I have seen cases where the optimizer makes some really bad decisions and putting a sub-query at the heart of things to limit the number of rows that any other stuff is applied can often lead it on the road to righteousness !

All that stuff about fragmention etc is excellent stuff and should be done. The stuff about keys is much harder to call it depends on what your write:read ratios are in general use but trying a sub-query will take five minutes to write the code and won't mess with your existing data.

BTW just how many rows do you have ?
posted by southof40 at 9:10 PM on January 31, 2007


The only other time the table is used is when I'm reporting dry historical data -- i.e. a plain select across the table for a particular fund ID to show changing values over time.

If that's the case, I'd go with splitting the database by financial year, provided you're not constantly UNIONing separate years (which will likely offset any performance gains). How many rows are we talking about here, anyway?

BTW - I'd also highly suggest taking Orthogonality up on his offer of advice off-forum, if need be. He's exceptionally competent.
posted by Civil_Disobedient at 1:23 AM on February 1, 2007


Response by poster: How many rows are we talking about here, anyway?

About half a million. It's 150+ funds tracked for every business day for the last 10 years or so.
posted by thanotopsis at 4:14 AM on February 1, 2007


How to index depends on your usage.

You might try a "cluster" index, which is a PHYSICAL re-ordering of the data on date/fund_id. Or you might try a cluster index on fund_id/date. Doin't put your $$ in the clustered index.

Sometimes putting every single field that you query into a regular (non-clustered) index will help.

And sometimes SQL server is just dumb and won't use the index unless to specifically tell it to with the "HINT" function.
posted by jaded at 10:37 AM on February 1, 2007


I'm confused. You have half a million rows at approximately 20 bytes per row, and that's an unwieldy table size? I'm intrigued as to why reducing table size is important at this level. Most people ask this question when they're hitting multi-gigabyte level, not 10 megabyte or so.
posted by wackybrit at 10:39 AM on February 1, 2007


Response by poster: wackybrit: I'm confused. You have half a million rows at approximately 20 bytes per row, and that's an unwieldy table size? I'm intrigued as to why reducing table size is important at this level. Most people ask this question when they're hitting multi-gigabyte level, not 10 megabyte or so.

Hey, if I knew what I was doing, I wouldn't be here, right? :)
posted by thanotopsis at 11:14 AM on February 1, 2007


Seriously, if you have a 10Mb table with performance problems, something is terribly wrong. (Unless by 'performance problems' you mean you want to query it like a thousand times a second.)

Since your schema looks sane (the first one,) I really really have to go with improper indexing.

What are your current indices/primary key?
posted by blenderfish at 1:00 PM on February 1, 2007


If you're doing this with historical data only, you can think about creating tables with hourly, daily, weekly and/or monthly averages instead individual transactions, and schedule regular queries on off-hours to update them.

Then you just run reports off the most appropriate table-- for example, if you want a 6 year line chart, you're not going to be concerned about each transaction, and daily or weekly averages will do just fine.

You'd still keep the original table as the canonical record, of course, but you'd rarely if ever run a query directly against it.
posted by empath at 6:48 PM on February 1, 2007


« Older Cold car after the weekend.   |   Brooklyn! Where to go? Newer »
This thread is closed to new comments.