mysql query help please
March 13, 2008 10:43 AM   Subscribe

MySQL question: Given a table with a datetime column and several additional columns which may or may not be null, is there an easy way to select the most recent previous non-null value of each column given a specific date?

Example:

Given the following table:


| 10/10 | ___3 | NULL | NULL |
| 10/09 | NULL | NULL | NULL |
| 10/08 | NULL | ___2 | NULL |
| 10/07 | NULL | ___5 | NULL |
| 10/06 | ___4 | NULL | ___7 |


given the input "10/08" I would like to retrieve the following row:
| 10/08 | ___4 | ___2 | ___7 |

given the input "10/10" I would like to retrieve the following row:
| 10/08 | ___3 | ___2 | ___7 |

That is, looking backwards from the given date, the most recent non-null value from each column. Is this doable? I hacked something together with nested subqueries, but it is super slow.
posted by Nothing to Computers & Internet (10 answers total)
 
Could you add a TRIGGER that on INSERT on your example table, populates a second table with the date and most recent non-NULL value for each column? Something like:

column_name
non_null_value
entry_date


With the example above, you'd end up with:

columnA | 3 | 10/10
columnB | 2 | 10/08
columnB | 5 | 10/07
columnA | 4 | 10/06
columnC | 7 | 10/06


For input 10/08, to get back "4" for the first column, you just do a SELECT on this table, where entry_date <> to get the first result. Repeat for the remaining columns. If the table is indexed the searches should run quickly.
posted by Blazecock Pileon at 11:18 AM on March 13, 2008


Let me try that again:

You could try a SELECT on the second table, where entry_date <= 10/08 AND column_name LIKE 'columnA' LIMIT 1 to get back the value "4" for input "10/08".
posted by Blazecock Pileon at 11:23 AM on March 13, 2008


Response by poster: That's a good idea, but there are a few things that make it less than ideal in this specific situation. The main one is that the columns are of different types, and I do not know in advance all the columns / column types.

This seems like a relatively easy thing to do. Just scan backwards down the index checking for non null values in the specified columns. But I can't find a way to do it.
posted by Nothing at 11:27 AM on March 13, 2008


Join the table to itself so you have 3 copies and then select the most recent from each. For a table called "NULL_VALUES", with columns A, B and C it'd be like:

SELECT n1.A, n2.B, n3.C FROM NULL_VALUES n1
LEFT JOIN NULL_VALUES n2 ON ([primary key join goes here])
LEFT JOIN NULL_VALUES n3 ON ([primary key join goes here])
WHERE n1.A IS NOT NULL AND n1.date_column <> WHERE n2.B IS NOT NULL AND n2.date_column <> WHERE n3.C IS NOT NULL AND n3.date_column <> ORDER BY n1.date_column DESC

I think that will work. COALESCE is another handy pal when you want to avoid null values. Apologies if the code above isn't quite right. You might have to play with the ORDER BY to get the proper values, but I'm not sure.
posted by yerfatma at 11:46 AM on March 13, 2008


Argh, it ate my dates from the WHERE clause. Hopefully you get the idea.
posted by yerfatma at 11:47 AM on March 13, 2008


Well, this works in SQL Server, but I'm assuming you're doing something similar when you say "nested subqueries"?

select id,
(select top 1 column1 from temp t1 where t1.id <= t0.id and t1.column1 is not null order by t1.id desc) as Column1,
(select top 1 column2 from temp t2 where t2.id <= t0.id and t2.column2 is not null order by t2.id desc) as Column2,
(select top 1 column3 from temp t3 where t3.id <= t0.id and t3.column3 is not null order by t3.id desc) as Column3
from temp t0
where id = '10/10'
posted by blue_beetle at 11:51 AM on March 13, 2008


Response by poster: Damn it, I have been thinking about this too long, and I completely mis-explained the problem.

I do not need to get only one row. I need all rows that are <>
given the input "10/08" I would like to retrieve the following result (using the example tble above):
| 10/08 | ___4 | ___2 | ___7 |
| 10/07 | ___4 | ___5 | ___7 |
| 10/06 | ___4 | ___5 | ___7 |

given the input "10/10" I would like to retrieve the following result:
| 10/10 | ___3 | ___2 | ___7 |
| 10/09 | ___4 | ___2 | ___7 |
| 10/08 | ___4 | ___2 | ___7 |
| 10/07 | ___4 | ___5 | ___7 |
| 10/06 | ___4 | ___5 | ___7 |

Sorry.
posted by Nothing at 12:19 PM on March 13, 2008


I'm not following your latest examples. If you want to "select the most recent previous non-null value of each column given a specific date", wouldn't the result of "10/10" be:

| 10/10 | ___3 | ___2 | ___7 |
| 10/09 | ___3 | ___2 | ___7 |
| 10/08 | ___3 | ___2 | ___7 |
| 10/07 | ___3 | ___5 | ___7 |
| 10/06 | ___4 | ___5 | ___7 |

And the 10/06 middle column wouldn't actually be 5 but whatever was the previous non-null value before the 5 on 10/07.
posted by junesix at 2:16 PM on March 13, 2008


Response by poster: I think I am going to run a few smaller queries and combine the result in program code. Thanks everyone for your comments. I do understand the problem a lot better now.

Orthagonality: It is somewhere between your two cases. The table models state, but state includes which measurements were taken. The various measurements are reported at different, often irregular intervals. The primary purpose of the table is to determine which measurements were taken at a particular time, and what their values are. From what I have been told, this was the fastest design that was tested when the database was set up. Of course, no one wanted to know the last known value of every measurement for a given date when it was set up.
posted by Nothing at 4:04 AM on March 14, 2008


Very hard question. More on that below.

The base table:

CREATE TABLE `foo` (
`id` int(11) NOT NULL primary key auto_increment,
`d` date default NULL,
`c1` int(11) default NULL,
`c2` int(11) default NULL,
`c3` int(11) default NULL
);

The data set:
id  d                       c1      c2      c3
1   2001-01-01 00:00:00.0   1       1       1
2   2001-01-02 00:00:00.0   2       <null>  <null>
3   2001-01-03 00:00:00.0   <null>  3       <null>
4   2001-01-04 00:00:00.0   <null>  <null>  4
5   2001-01-05 00:00:00.0   <null>  <null>  <null>
6   2001-01-06 00:00:00.0   <null>  <null>  6
7   2001-01-07 00:00:00.0   <null>  <null>  7
8   2001-01-08 00:00:00.0   <null>  8       <null>
9   2001-01-09 00:00:00.0   <null>  <null>  9
10  2001-01-10 00:00:00.0   10      <null>  <null>

A view that self-joins foo on dates the same or pior to the date in foo (foo.d):

create or replace view foo_diff as
select a.id, a.d, b.id as bid, b.d as bd, a.d - b.d as ddiff, b.c1, b.c2, b.c3
from foo a left outer join foo b on (a.d >= b.d);


The select statement that does what you want.
Note that it won't return any dates for which there exists no qualifying non-null c1 value.
To fix this you can add an outer foo to the join.

SELECT a.id, a.d, a.c1, b.c2, c.c3
FROM foo_diff a join foo_diff b on (a.id = b.id and b.c2 is not null) join foo_diff c on (a.id = c.id and c.c3 is not null)
where a.ddiff <= ALL (select ddiff from foo_diff sa where sa.id = a.id and sa.c1 is not null )
and b.ddiff <= ALL (select ddiff from foo_diff sb where sb.id = b.id and sb.c2 is not null )
and c.ddiff <= ALL (select ddiff from foo_diff sc where sc.id = c.id and sc.c3 is not null )
and a.c1 is not null
--and b.c2 is not null
--and c.c3 is not null
order by a.id;

Note also the null predicates for b and c are commented out, as I put them in the joins.


Okay, this query won't perform real well, the query plan shows:
id  select_type             table   type    key     rows    extra
1   PRIMARY                 a       ALL     PRIMARY 10      Using temporary; Using filesort
1   PRIMARY                 b       ALL     <null>  10      Using where
1   PRIMARY                 a       eq_ref  PRIMARY  1
1   PRIMARY                 b       ALL     <null>  10      Using where
1   PRIMARY                 a       eq_ref  PRIMARY  1
1   PRIMARY                 b       ALL     <null>  10      Using where
4   DEPENDENT SUBQUERY      a       eq_ref  PRIMARY 4       Using where
4   DEPENDENT SUBQUERY      b       ALL     10              Using where
3   DEPENDENT SUBQUERY      a       eq_ref  PRIMARY 4       Using where
3   DEPENDENT SUBQUERY      b       ALL     10              Using where
2   DEPENDENT SUBQUERY      a       eq_ref  PRIMARY 4       Using where
2   DEPENDENT SUBQUERY      b       ALL     10              Using where

An alternative version seems to perform even less well, but might do better depending on your dataset. Since I don't know your data set, and you're not paying me, I'm not going to try quantifying which query is "better."

Alternative view:

create view foo_diff2 as select a.id, a.d, b.c1, c.c2, d.c3, a.d - b.d as bdiff, a.d - c.d as cdiff, a.d - d.d as ddiff
from foo a left outer join foo b on (a.d >= b.d and b.c1 is not null)
left outer join foo c on (a.d >= c.d and c.c2 is not null)
left outer join foo d on (a.d >= d.d and d.c3 is not null)
order by a.id;


Alternatve query:

SELECT * FROM foo_diff2 a
where a.bdiff <= ALL (select bdiff from foo_diff2 b where b.id = a.id)
and a.cdiff <= ALL (select cdiff from foo_diff2 c where c.id = a.id)
and a.ddiff <= ALL (select ddiff from foo_diff2 d where d.id = a.id);


Alternative plan:
id  select_type             table   type    key     rows    extra
1   PRIMARY                 a       ALL     <null>  10      Using temporary; Using filesort
1   PRIMARY                 b       ALL     <null>  10      Using where
1   PRIMARY                 c       ALL     <null>  10      Using where
1   PRIMARY                 d       ALL     <null>  10      Using where
4   DEPENDENT SUBQUERY      a       eq_ref  PRIMARY 1       Using where; Using temporary; Using filesort
4   DEPENDENT SUBQUERY      b       ALL     <null>  10
4   DEPENDENT SUBQUERY      c       ALL     <null>  10
4   DEPENDENT SUBQUERY      d       ALL     <null>  10      Using where
3   DEPENDENT SUBQUERY      a       eq_ref  PRIMARY 1       Using where; Using temporary; Using filesort
3   DEPENDENT SUBQUERY      b       ALL     <null>  10
3   DEPENDENT SUBQUERY      c       ALL     <null>  10      Using where
3   DEPENDENT SUBQUERY      d       ALL     <null>  10      Using where
2   DEPENDENT SUBQUERY      a       eq_ref  PRIMARY 1       Using where; Using temporary; Using filesort
2   DEPENDENT SUBQUERY      b       ALL     <null>  10      Using where
2   DEPENDENT SUBQUERY      c       ALL     <null>  10
2   DEPENDENT SUBQUERY      d       ALL     <null>  10      Using where
Notice all the filesorts. Filesorts are expensive, mostly because you're writing a file to disk. This suggests that the alternatve is less effeciant, not to mention that we have more steps in this plan.


PLEASE NOTE:

I have not tested either query against all corner cases. I have not seen your data or data model. Use at your own risk, and do your own testng to assure it works with your data and data model.


Ok, analysis time. The reason this query is so hard is that your data model is bad. It's not clear what you're trying to model here, but there are a couple of possibilities:

The mostly likely is 1), you're modeling state over time. That is, what you want to model is actually the result of the query. In that case, you should just model ALL the state at each point in time, by storing the value of each column at each date. Then you don't need to scan backward to the most recent prior non-null value.

Less likely, 2), you're trying to model when state changed. In that case you should just model that, with a tuple (entity_id, date, attribute_id, new_state_value). This is akin to a sixth normal form temporal database. (I briefly touch on this topic in an earlier askMefi answer.)

In any case, when a query is very hard, that's generally an indication that your data model isn't really capturing what you need to model.

In general, null should be used to mean "we don't know this attribute's value"; instead you're using it as a sentinel value to mean, "look elsewhere for this attribute's known value". That's a clever (too clever) misuse of database null.

(It's not necessarily a misuse of pointer null in a C-like language, though if I were doing I'd suggest using a special value other than pointer null (or a distinct type in an OO langauge) to do this.) I suspect that a C-like language programmer's misunderstandings of the difference between database null and pointer null influenced the design of your table.)
posted by orthogonality at 9:01 PM on March 14, 2008


« Older I need to win so my friends won't laugh at me   |   Disservice to Dog? Newer »
This thread is closed to new comments.