Join 3,556 readers in helping fund MetaFilter (Hide)

Tags:

How do I get the record with the most current date by a key field?
November 21, 2006 2:31 PM   Subscribe

SQL question: How do I get the record with the most current date for each person's ID? Better explanation inside.

Here's an example set of data:

ID TestDate Score1 Status1 Score2 Status2
11111 5/10/05 325 N R
11111 9/1/06 330 N R
22222 2/8/05 300 N 325 N
22222 5/10/05 360 P 365 P

What I want to return on a report is only the records that have the most current TestDate for each ID:

ID TestDate Score1 Status1 Score2 Status2
11111 9/1/06 330 N R
22222 5/10/05 360 P 365 P


I'm trying to do something like this:

select
ID, max(TestDate), Status1, Score1, Status2, Score2
from myTable
group by ID, Status1, Score1, Status2, Score2


But it's still giving me the same result set as if I didn't have the MAX or GROUP BY in there. My knowledge of the use of GROUP BY and HAVING is limited.

Help a n00b out? I'm using SQL Server 2005, BTW.
posted by DakotaPaul to Computers & Internet (14 answers total) 2 users marked this as a favorite
 
You really need a sub-query to get this - right now, you're picking all the scores and statuses. If they happen to have the same score and status on two tests, you'll get only the most recent of them, but otherwise, you get one line per score/status/ID.

Something like this will be closer to what you need:

Select ID, testDate, Status1, Score1, Status2, Score2
from myTable a
where testDate = (select max(testDate) from myTable b
where a.ID = b.ID)

The sub-query picks the maximum test date for the ID, and then the main query picks the line from the list that has that particular date on it.

Depending on how your database is set up, this may prove vastly inefficient, which only really matters if its a fairly large dataset. In that case, it might be better to temporarily material a table of ID, max(testdate) pairs and join the two tables together.
posted by jacquilynne at 2:40 PM on November 21, 2006 [1 favorite]


Wouldn't it be simpler to do:
SELECT 
   ID testDate, Status1, Score1, Status2, Score2
FROM
   myTable
ORDER BY testDate LIMIT 1

posted by IshmaelGraves at 2:54 PM on November 21, 2006


IshmaelGraves writes "Wouldn't it be simpler to do:"

No, for several reasons. Try it and see what happens.

jacquilynne's answer is correct.
posted by orthogonality at 3:06 PM on November 21, 2006


Classical SQL for that would probably be

SELECT s2.* FROM myTable s1, myTable s2 WHERE s1.ID=s2.ID GROUP BY s1.ID, s2.testDate HAVING s2.testDate=MAX(s1.testDate);

Which is the same as the sub query. What actually performs better will depend on context and not matter on databases maintained by people who have to ask SQL questions on AskMe.
posted by themel at 3:09 PM on November 21, 2006


jacquilynn's response may work. IshamelGraves' will only return the single row with the earliest test date.

Something I've done in a quick hack is feed all unique IDs into an array (something like 'SELECT ID FROM myTable group by ID'), then whack the database repeatedly with queries, getting the latest response for each ID (something like 'SELECT * FROM myTable WHERE ID='foo[i]' ORDER BY testDate DESC LIMIT 1'). This works and is horrible and inefficient. Don't do it.
posted by ardgedee at 3:19 PM on November 21, 2006


ardgedee writes "Something I've done in a quick hack is feed all unique IDs into an array (something like 'SELECT ID FROM myTable group by ID'), then whack the database repeatedly with queries, getting the latest response for each ID (something like 'SELECT * FROM myTable WHERE ID='foo[i]' ORDER BY testDate DESC LIMIT 1'). This works and is horrible and inefficient. Don't do it."


That's exactly what a subquery does, except you're pulling the data to the client and then forcing the server to compile the same query (with a different id each time) over and over again.
posted by orthogonality at 3:22 PM on November 21, 2006


IANADBA but it seems to me that by joining to the subquery instead of using it in the WHERE clause you'd let the DB run the subquery only once:

select * from myTable join ( select ID, max(testDate) as testDate from myTable group by ID ) as myTableAlias using (ID, testDate)
 
orthogonality, am I mistaken? Or is that all that jacquilynne meant by "temporarily material"?
posted by nicwolff at 3:55 PM on November 21, 2006


nicwolff writes "orthogonality, am I mistaken? "

That's a QoI question, dependent on the query optimizer and also on the indices on the table. I suspect jacquilynne meant "make a temp table", but temporarily materialize could also mean doing what you suggest or using a view. Again, how the database implements a join, a subquery or a temp table is QoI.

What you're doing is essentially using an unamazed temporary view; Sybase supports this (although it requires a more explicit syntax than you show). I honestly don't know id any databases would support the exact syntax you use, which gives me something to do when I'm more motivated, so thank you.

Note also the problem -- in all the answers -- if one user has two tests with the same date, and that happens to be the max( date ).
posted by orthogonality at 4:09 PM on November 21, 2006


In the databases I have the most experience with, I'd have actually materialized the table of ID / max date pairs - created a temporary saved data table and then deleted it after. But that's an implementation specific thing - we got heavily dinged by our support staff if our queries ran over the governor time.

Dumping a quick result to a temporary table and then joining it to that table was space inefficient, and not even particularly time efficient - but it divided the work into two separate operations which would usually come in under the governor (assuming the ID we were saving in the temp table was in the index on the main table).

Doing it as a sub-query was the fastest code to write and execute, so I'd do that on small table sets where I was pretty sure I'd bring it in under the governor. The sort of quick, true temporary tables that automatically vanish with the query would probably be the most efficient solution in many implementations - but as they ended up counting as part of the same operation for the governor, they did me no good for hiding from the angry sysadmins, so I never really learned how to use them particularly well.

I don't get the impression though, that most of this is likely to be a concern with this particular Ask, so I just opted for the fastest code that I knew I could write well.
posted by jacquilynne at 4:34 PM on November 21, 2006


I know my query works on PostgreSQL and is a lot faster:

# explain select * from myTable a join ( select ID, max(testDate) as testDate from myTable group by ID ) b using (ID, testDate);

QUERY PLAN
-------------------------------------------------------------------------------------
Merge Join (cost=159.86..173.66 rows=8 width=16)
Merge Cond: (("outer".testdate = "inner".testdate) AND ("outer".id = "inner".id))
-> Sort (cost=113.27..117.34 rows=1630 width=16)
Sort Key: a.testdate, a.id
-> Seq Scan on mytable a (cost=0.00..26.30 rows=1630 width=16)
-> Sort (cost=46.59..47.09 rows=200 width=12)
Sort Key: b.testdate, b.id
-> HashAggregate (cost=34.45..36.95 rows=200 width=12)
-> Seq Scan on mytable (cost=0.00..26.30 rows=1630 width=12)
(9 rows)

# explain select * from myTable a where testDate = ( select max(testDate) as testDate from myTable b where a.ID = b.ID );

QUERY PLAN
------------------------------------------------------------------------
Seq Scan on mytable a (cost=0.00..49594.60 rows=8 width=16)
Filter: (testdate = (subplan))
SubPlan
-> Aggregate (cost=30.40..30.41 rows=1 width=8)
-> Seq Scan on mytable b (cost=0.00..30.38 rows=8 width=8)
Filter: ($0 = id)
(6 rows)

posted by nicwolff at 5:57 PM on November 21, 2006 [1 favorite]


Thank you all for the help and suggestions. I quickly tried jacquilynne's query and it looks like it does what I want. I'll try nicwolff's when I get back to work next week. I feel pretty silly -- the query was a lot simpler than I thought.
posted by DakotaPaul at 6:41 PM on November 21, 2006


nicwolff writes "I know my query works on PostgreSQL and is a lot faster:"


Good work, actually runing it and showing results. Thanks.
posted by orthogonality at 4:36 AM on November 22, 2006


I think the following would be the most efficient in SQL 2005. Uses a derived table instead of a sub query. Certainly more efficient than the one currently marked as best answer...

Select a.ID, b.testDate, a.Status1, a.Score1, a.Status2, a.Score2
from myTable a
inner join (select ID, max(testDate) as testDate
from myTable
group by ID) b
ON a.ID = b.ID AND a.testDate = b.testDate
posted by chill at 5:51 AM on November 22, 2006


Although I've found that if you are dealing with lots of records in SQL 2005 (hundreds or thousands) dumping the derived data into a temp table first and joining to that tends to be quicker.
posted by chill at 5:53 AM on November 22, 2006


« Older How can I store our company st...   |  LosAngeles-talk: what's that s... Newer »
This thread is closed to new comments.