There's got to be a faster way to update
March 27, 2009 1:02 PM Subscribe
Please help me simplify a complicated mysql update
I have a giant database with data nested in 3 levels. Year (00-07), time (1-15) and location (1-125000)
Let's say I have a field called z. I am trying to add a field called cumulative z (by year) with each record.
My current approach to inserting cumulative z values is as follows:
For the sake of tractability, I have split the data into separate tables for each year.
Using PHP, I have 2 nested loops (time and location).
Within the inner loop, the following happens:
If time is 1, then cumulative z= z for that location.
if time>1
then cumulative z= z+cumulative z for time t-1. This step is the bottleneck. I have to write another query statement to find out what cum z was for the same location at t-1.
The query takes very long for each location. I estimate that it will take several days to do this for a single year. There must be a faster way. Can any SQL gurus offer suggestion to make this update faster?
I have a giant database with data nested in 3 levels. Year (00-07), time (1-15) and location (1-125000)
Let's say I have a field called z. I am trying to add a field called cumulative z (by year) with each record.
My current approach to inserting cumulative z values is as follows:
For the sake of tractability, I have split the data into separate tables for each year.
Using PHP, I have 2 nested loops (time and location).
Within the inner loop, the following happens:
If time is 1, then cumulative z= z for that location.
if time>1
then cumulative z= z+cumulative z for time t-1. This step is the bottleneck. I have to write another query statement to find out what cum z was for the same location at t-1.
The query takes very long for each location. I estimate that it will take several days to do this for a single year. There must be a faster way. Can any SQL gurus offer suggestion to make this update faster?
I'm not really getting exactly what it is you're trying to do. Let me see if I'm understanding it right.
You have a table with
Year,
Location,
Time,
Z,
CumZ
At each location, loop through time summing the values of Z so far, and putting that into the CumZ field.
For example:
2000, 1, 1, 1, NULL
2000, 1, 2, 3, NULL
2000, 1, 3, 5, NULL
2000, 2, 1, 2, NULL
2000, 2, 2, 4, NULL
2000, 2, 3, 6, NULL
is the original data. After update it would look like:
2000, 1, 1, 1, 1
2000, 1, 2, 3, 4
2000, 1, 3, 5, 9
2000, 2, 1, 2, 2
2000, 2, 2, 4, 6
2000, 2, 3, 6, 12
Have I got the right idea?
posted by MagicEightBall at 1:21 PM on March 27, 2009
You have a table with
Year,
Location,
Time,
Z,
CumZ
At each location, loop through time summing the values of Z so far, and putting that into the CumZ field.
For example:
2000, 1, 1, 1, NULL
2000, 1, 2, 3, NULL
2000, 1, 3, 5, NULL
2000, 2, 1, 2, NULL
2000, 2, 2, 4, NULL
2000, 2, 3, 6, NULL
is the original data. After update it would look like:
2000, 1, 1, 1, 1
2000, 1, 2, 3, 4
2000, 1, 3, 5, 9
2000, 2, 1, 2, 2
2000, 2, 2, 4, 6
2000, 2, 3, 6, 12
Have I got the right idea?
posted by MagicEightBall at 1:21 PM on March 27, 2009
Response by poster: 1. How giant?
Each year has 125000 x 15 = 1875000 records.
2. Define your table structures? Are they indexed properly? Scratch that ... are they indexed?
nope. I don't even know how.
3. Include your existing code and query?
Sure (this is the php code for year 2000. I will run similar scripts for the other years).
posted by special-k at 1:31 PM on March 27, 2009
Each year has 125000 x 15 = 1875000 records.
2. Define your table structures? Are they indexed properly? Scratch that ... are they indexed?
nope. I don't even know how.
3. Include your existing code and query?
Sure (this is the php code for year 2000. I will run similar scripts for the other years).
for($sid=2;$sid<=15;$sid++)
{
$result=mysql_query("SELECT * from f2000 where SAMPLE_ID=$sid",$db);
while($row=mysql_fetch_array($result))
{
$pid=$sid-1;
$presult=mysql_fetch_array(mysql_query("SELECT cCUMTEMP from f2000 where SAMPLE_ID=$pid && ID=$row[ID]",$db));
$cCT=$presult[0]+$row[TEMP];
mysql_query("UPDATE f2000 SET cCUMTEMP=$cCT where ID=$row[ID]",$db);
}
}
posted by special-k at 1:31 PM on March 27, 2009
Response by poster: MagicEightBall: Yep, you got it.
posted by special-k at 1:32 PM on March 27, 2009
posted by special-k at 1:32 PM on March 27, 2009
Best answer: For a table named "Table"
Update Table a
set CumZ = (select sum(z) from Table b where b.year = a.year AND b.location = a.location AND b.time < = a.time)
I think that might work. My brain is slightly hazy on Friday afternoons.
posted by MagicEightBall at 1:35 PM on March 27, 2009
Update Table a
set CumZ = (select sum(z) from Table b where b.year = a.year AND b.location = a.location AND b.time < = a.time)
I think that might work. My brain is slightly hazy on Friday afternoons.
posted by MagicEightBall at 1:35 PM on March 27, 2009
Forgot to mention, that's one query that will update every row. You don't have to do any looping at all.
posted by MagicEightBall at 1:41 PM on March 27, 2009
posted by MagicEightBall at 1:41 PM on March 27, 2009
"Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. If a table has 1,000 rows, this is at least 100 times faster than reading sequentially."
First and foremost, you need to index your tables. An RDBMS should be able to handle well over 100 million rows without issues when everything is set up well; you just need to make sure you aren't inadvertently causing worst-case algorithmic performance somewhere.
For a better long term solution, read the first couple chapters of An Introduction to Database Systems by C.J. Date (it's a college textbook, so the previous edition will be significantly cheaper). To really take advantage of the flexibility and efficiency of relational databases, you need to understand the relational model and the somewhat idiosyncratic ways it expects your tables to be structured (such as "normalization").
Also: tutorials that assume you're working with MySQL are sometimes very poorly informed about relational databases.
posted by silentbicycle at 4:28 PM on March 27, 2009
First and foremost, you need to index your tables. An RDBMS should be able to handle well over 100 million rows without issues when everything is set up well; you just need to make sure you aren't inadvertently causing worst-case algorithmic performance somewhere.
For a better long term solution, read the first couple chapters of An Introduction to Database Systems by C.J. Date (it's a college textbook, so the previous edition will be significantly cheaper). To really take advantage of the flexibility and efficiency of relational databases, you need to understand the relational model and the somewhat idiosyncratic ways it expects your tables to be structured (such as "normalization").
Also: tutorials that assume you're working with MySQL are sometimes very poorly informed about relational databases.
posted by silentbicycle at 4:28 PM on March 27, 2009
Oops - the second link was supposed to point here, missed that on preview.
The Definitive Guide to SQLite by Mike Owens has a good, short (though less academically rigorous, which could be a good or bad thing) overview of the relational database concepts as well.
But, index your tables! Then see if you need to worry about optimizing further.
posted by silentbicycle at 4:40 PM on March 27, 2009
The Definitive Guide to SQLite by Mike Owens has a good, short (though less academically rigorous, which could be a good or bad thing) overview of the relational database concepts as well.
But, index your tables! Then see if you need to worry about optimizing further.
posted by silentbicycle at 4:40 PM on March 27, 2009
update a
set a.zCum = cumSum
from
YOURTABLE a inner join
(select sum(b.z) as cumSum, a.iYear, a.iTime, a.iLoc from YOURTABLE a inner join YOURTABLE b
ON a.iYear = b.iYear and b.iTime <> group by a.iYear, a.iTime, a.iLoc ) b
on a.iYear = b.iYear and a.iTime = b.iTime and a.iLoc = b.iLoc>
posted by spatula at 5:00 PM on March 27, 2009
set a.zCum = cumSum
from
YOURTABLE a inner join
(select sum(b.z) as cumSum, a.iYear, a.iTime, a.iLoc from YOURTABLE a inner join YOURTABLE b
ON a.iYear = b.iYear and b.iTime <> group by a.iYear, a.iTime, a.iLoc ) b
on a.iYear = b.iYear and a.iTime = b.iTime and a.iLoc = b.iLoc>
posted by spatula at 5:00 PM on March 27, 2009
Argh, forgot about the html filter, some of that join was chopped off. What is the best way to disable HTML on comments?
posted by spatula at 5:06 PM on March 27, 2009
posted by spatula at 5:06 PM on March 27, 2009
Best answer: Figured it out, sorry. <<< WHEE >>>
OK so first in order for this to work you need to add a surrogate key to your table so we can compare TIME and LOC at the same time:
Assuming your table looks something like this:
create YOURTABLE (
iYear int
, iTime int
, iLoc int
, z int
, cumZ int
)
alter table YOURTABLE add iOrder int
update YOURTABLE set iOrder = (iTime * 1000000 + iLoc )
so if you have
06, 1, 1, 5
06, 1, 2, 7
06, 1, 3, 9
06, 2, 1, 2
06, 2, 2, 4
06, 2, 3, 7
your order field would look like :
06, 1, 1, 5, 10000001
06, 1, 2, 7, 10000002
06, 1, 3, 9, 10000003
06, 2, 1, 2, 20000001
06, 2, 2, 4, 20000002
06, 2, 3, 7, 20000003
now run this bad boy:
update a
set a.zCum = cumSum
from
YOURTABLE a inner join
(select sum(b.z) as cumSum, a.iYear, a.iOrder from YOURTABLE a inner join YOURTABLE b
ON a.iYear = b.iYear and b.iOrder <= a.iOrder
group by a.iYear, a.iOrder ) b
on a.iYear = b.iYear and a.iOrder = b.iOrder
posted by spatula at 5:28 PM on March 27, 2009
OK so first in order for this to work you need to add a surrogate key to your table so we can compare TIME and LOC at the same time:
Assuming your table looks something like this:
create YOURTABLE (
iYear int
, iTime int
, iLoc int
, z int
, cumZ int
)
alter table YOURTABLE add iOrder int
update YOURTABLE set iOrder = (iTime * 1000000 + iLoc )
so if you have
06, 1, 1, 5
06, 1, 2, 7
06, 1, 3, 9
06, 2, 1, 2
06, 2, 2, 4
06, 2, 3, 7
your order field would look like :
06, 1, 1, 5, 10000001
06, 1, 2, 7, 10000002
06, 1, 3, 9, 10000003
06, 2, 1, 2, 20000001
06, 2, 2, 4, 20000002
06, 2, 3, 7, 20000003
now run this bad boy:
update a
set a.zCum = cumSum
from
YOURTABLE a inner join
(select sum(b.z) as cumSum, a.iYear, a.iOrder from YOURTABLE a inner join YOURTABLE b
ON a.iYear = b.iYear and b.iOrder <= a.iOrder
group by a.iYear, a.iOrder ) b
on a.iYear = b.iYear and a.iOrder = b.iOrder
posted by spatula at 5:28 PM on March 27, 2009
To get your query to show properly replace < with "<" and > with ">". (But don't cut and paste those escape sequences from this comment, just type what you see, and don't include the quotes.)
posted by XMLicious at 5:30 PM on March 27, 2009
posted by XMLicious at 5:30 PM on March 27, 2009
No, no, no. No, no, no, no.
If you want to get the cumulative value of a column, you do a group by and a sum, e.g., given the table
create table yearmonthdayvalue (year int, month int, day int, value int ) ;
with data like,
2000, 1, 1, 100
2000, 1, 2, 222,
...
2008, 12, 30, 333
2008, 12, 31, 555
if you want the cumulative by year, you do this select: select year, sum(value) as cumulative from yearmonthdayvalue group by year;
and you get this result:
year cumulative
2000 322
...
2008 888
if you wanted the cumulative per month, you'd do this: select year, month, sum(value) as cumulative from yearmonthdayvalue group by year, month;
Now some notes: this will be MUCH faster and better than pulling out the data and adding it yourself as:
1) you don't pay the cost of transporting all 1875000 records out of the db and down the wire to your client which then has to decode those values and iterate through them
2) the RDBMS was written with the "sum" function in mind, as has implemented it with some highly optimized C or assembly code to do it in place and do it fast
3) it will do the right thing with null values
4) and a "sum" function mandated by the ANSI SQL Standard the more likely to be tested and accurate than anything you (or I) roll on our own.
Then create a view named cumulative_sum from your query, so you can use it again and again.
But most important: storing the value in a "cumulative" column is all sorts of fail because the moment any of your records change, are added or deleted, YOUR TEDIOUSLY CALCULATED CUMULATIVE SUM WILL BE .
If, after adding appropriate indices, it still takes "too long" to calculate the sum using a group by, at least get the first four benefits by calculating in the database and then adding the calculated value as a dangerous but "necessary" denormalization.
posted by orthogonality at 5:57 PM on March 27, 2009 [2 favorites]
If you want to get the cumulative value of a column, you do a group by and a sum, e.g., given the table
create table yearmonthdayvalue (year int, month int, day int, value int ) ;
with data like,
2000, 1, 1, 100
2000, 1, 2, 222,
...
2008, 12, 30, 333
2008, 12, 31, 555
if you want the cumulative by year, you do this select: select year, sum(value) as cumulative from yearmonthdayvalue group by year;
and you get this result:
year cumulative
2000 322
...
2008 888
if you wanted the cumulative per month, you'd do this: select year, month, sum(value) as cumulative from yearmonthdayvalue group by year, month;
Now some notes: this will be MUCH faster and better than pulling out the data and adding it yourself as:
1) you don't pay the cost of transporting all 1875000 records out of the db and down the wire to your client which then has to decode those values and iterate through them
2) the RDBMS was written with the "sum" function in mind, as has implemented it with some highly optimized C or assembly code to do it in place and do it fast
3) it will do the right thing with null values
4) and a "sum" function mandated by the ANSI SQL Standard the more likely to be tested and accurate than anything you (or I) roll on our own.
Then create a view named cumulative_sum from your query, so you can use it again and again.
But most important: storing the value in a "cumulative" column is all sorts of fail because the moment any of your records change, are added or deleted, YOUR TEDIOUSLY CALCULATED CUMULATIVE SUM WILL BE .
If, after adding appropriate indices, it still takes "too long" to calculate the sum using a group by, at least get the first four benefits by calculating in the database and then adding the calculated value as a dangerous but "necessary" denormalization.
posted by orthogonality at 5:57 PM on March 27, 2009 [2 favorites]
Let's discuss indices, and why they make the the database able to find a sum much faster than your client code can.
Again, given a table with a tuple( year, month, day, value) (and in real life we'd just use a tuple (date, value), but this is for illustrative purposes only), let's discuss the obvious and "naive" algorithm clirnt code would use to implement a group by.
We'll assume that the client code is smart enough to do a select year, month, day, value order by year, month, day;, thus making the database, not the client, do the sorting. Then we'd start at the first row, and do something like this
Vector years;
Vector cumForYear;
int curYear = currentRow.year;
int cum = currentRow.value;
while( currentRow = getNextRow() ) {
if( curYear != currentRow.year ) {
years.push(currentYear);
cumForYear.push(cum);
cum = currentRow.value;
curYear = currentRow.year;
} else {
cum += currentRow.value;
}
years.push(currentYear);
cumForYear.push(cum);
In other words, we'd do a straight-forward iteration over the values, after making the database sort them for us and send them to us.
Now contrast how an appropriately indexed database would do it. The rows would be indexed. They'd probably be in an n-ary tree, but as any n-ary tree can be trivially converted to a binary tree, we'll consider the index a binary tree. In which case, from the root node, approximately half the records are down the left subtree, and half are down the right subtree. And retreiving records in index order (which in this case, is date order) is just a matter of doing an inorder traversal of the tree, adding to our output the year and the cumulative when a node has a year different than the previous year we saw (this is similar to the naive strategy).
But it gets better: if we have the server running on, say, a quad-core machine, it's entirely possible that when we get to depth two of the index tree (where there are a total of four child nodes), each core can independently and simultaneously sum up a different subtree.
Even better, when the cores finished, the core that went down the left-most subtree has the earliest years summed up, the rightmost the latest years, etc. All that needs to be done is to look at the last year of any one core's list, and the first year of the next core's list, and add the cumulatives together if the cores overlapped a particular year.
Or in other words, we got the answer for (almost) the same cost as getting the sort -- and remember, the naive sum in the client strategy asked for the rows in sorted order. That is, the database has to do almost the same traversal for summing and grouping as for sorting, only adding the cost of addition, addition in C or assembly -- and without the cost of pushing lots of data down the wire for the client to crack and convert from the database's C library format to the client language variable type (not insignificant, especially for weakly-typed scripting lanaguages where variables are likely to be pointers to C structs or C unions).
posted by orthogonality at 6:34 PM on March 27, 2009 [1 favorite]
Again, given a table with a tuple( year, month, day, value) (and in real life we'd just use a tuple (date, value), but this is for illustrative purposes only), let's discuss the obvious and "naive" algorithm clirnt code would use to implement a group by.
We'll assume that the client code is smart enough to do a select year, month, day, value order by year, month, day;, thus making the database, not the client, do the sorting. Then we'd start at the first row, and do something like this
Vector years;
Vector cumForYear;
int curYear = currentRow.year;
int cum = currentRow.value;
while( currentRow = getNextRow() ) {
if( curYear != currentRow.year ) {
years.push(currentYear);
cumForYear.push(cum);
cum = currentRow.value;
curYear = currentRow.year;
} else {
cum += currentRow.value;
}
years.push(currentYear);
cumForYear.push(cum);
In other words, we'd do a straight-forward iteration over the values, after making the database sort them for us and send them to us.
Now contrast how an appropriately indexed database would do it. The rows would be indexed. They'd probably be in an n-ary tree, but as any n-ary tree can be trivially converted to a binary tree, we'll consider the index a binary tree. In which case, from the root node, approximately half the records are down the left subtree, and half are down the right subtree. And retreiving records in index order (which in this case, is date order) is just a matter of doing an inorder traversal of the tree, adding to our output the year and the cumulative when a node has a year different than the previous year we saw (this is similar to the naive strategy).
But it gets better: if we have the server running on, say, a quad-core machine, it's entirely possible that when we get to depth two of the index tree (where there are a total of four child nodes), each core can independently and simultaneously sum up a different subtree.
Even better, when the cores finished, the core that went down the left-most subtree has the earliest years summed up, the rightmost the latest years, etc. All that needs to be done is to look at the last year of any one core's list, and the first year of the next core's list, and add the cumulatives together if the cores overlapped a particular year.
Or in other words, we got the answer for (almost) the same cost as getting the sort -- and remember, the naive sum in the client strategy asked for the rows in sorted order. That is, the database has to do almost the same traversal for summing and grouping as for sorting, only adding the cost of addition, addition in C or assembly -- and without the cost of pushing lots of data down the wire for the client to crack and convert from the database's C library format to the client language variable type (not insignificant, especially for weakly-typed scripting lanaguages where variables are likely to be pointers to C structs or C unions).
posted by orthogonality at 6:34 PM on March 27, 2009 [1 favorite]
Orthogonality knows what he/she is talking about.
posted by silentbicycle at 6:51 PM on March 27, 2009
posted by silentbicycle at 6:51 PM on March 27, 2009
Response by poster: But most important: storing the value in a "cumulative" column is all sorts of fail because the moment any of your records change, are added or deleted, YOUR TEDIOUSLY CALCULATED CUMULATIVE SUM WILL BE INCORRECT.
Good point ortho. But let me elaborate my needs a bit more.
The data will not change and so this is a one time deal. The reason behind creating this column is that I need to do statistics on these data. I will also be filtering these data based on other fields, so to get the cumulative value on the fly, in my statistical software, will be much much more slower. It's just easier to have this value as part of each record.
Once I have this column, I will write each years data to a comma separated file and then read it in R. I have tried this and it works beautifully.
I ran magicEightBall's statement and it's been going for over 12 hours now (on a core duo machine with maxed out RAM).
posted by special-k at 6:55 AM on March 28, 2009
Good point ortho. But let me elaborate my needs a bit more.
The data will not change and so this is a one time deal. The reason behind creating this column is that I need to do statistics on these data. I will also be filtering these data based on other fields, so to get the cumulative value on the fly, in my statistical software, will be much much more slower. It's just easier to have this value as part of each record.
Once I have this column, I will write each years data to a comma separated file and then read it in R. I have tried this and it works beautifully.
I ran magicEightBall's statement and it's been going for over 12 hours now (on a core duo machine with maxed out RAM).
posted by special-k at 6:55 AM on March 28, 2009
Response by poster: PS: I'll work on indexing the tables this am.
posted by special-k at 6:58 AM on March 28, 2009
posted by special-k at 6:58 AM on March 28, 2009
Response by poster: Once I created a composite column (spatula's suggestion) and ran MagicEightBall's code, it worked beautifully.
posted by special-k at 12:19 PM on March 30, 2009
posted by special-k at 12:19 PM on March 30, 2009
Response by poster: on orthogonality's suggestion, I am posting some info on the speed difference.
Before creating that composite field and indexing, the operation took over 45 hours and had not even worked through half a table.
After the composite/index then operation took 2 minutes and 33 seconds per table!
w00t!
posted by special-k at 12:49 PM on April 1, 2009 [1 favorite]
Before creating that composite field and indexing, the operation took over 45 hours and had not even worked through half a table.
After the composite/index then operation took 2 minutes and 33 seconds per table!
w00t!
posted by special-k at 12:49 PM on April 1, 2009 [1 favorite]
This thread is closed to new comments.
2. Define your table structures? Are they indexed properly? Scratch that ... are they indexed?
3. Include your existing code and query?
posted by devbrain at 1:20 PM on March 27, 2009