Finding rows that don't exist in an SQL Server 2005 table
October 7, 2009 12:24 PM Subscribe
I can't seem to build a query in SQL Server 2005 that returns rows that are not in a table.
I have a table called
Table
In this example, the attendance for student 111, period 4, is missing for 2009-10-02 and 2009-10-03. How can I build a query to get that result? Ideally, the query would return:
I've tried a
I have a table called
sch
that includes student IDs and the periods for which the students are scheduled:
id period
111 1
111 2
111 3
111 4
111 5
Table
att
includes student IDs and the dates and periods they attended:
id date period
111 2009-10-01 1
111 2009-10-01 2
111 2009-10-01 3
111 2009-10-01 4
111 2009-10-01 5
111 2009-10-02 1
111 2009-10-02 2
111 2009-10-02 3
111 2009-10-02 5
111 2009-10-03 1
111 2009-10-03 2
111 2009-10-03 3
111 2009-10-03 5
In this example, the attendance for student 111, period 4, is missing for 2009-10-02 and 2009-10-03. How can I build a query to get that result? Ideally, the query would return:
id date period
111 2009-10-02 4
111 2009-10-03 4
I've tried a
left join
on the tables between the IDs and periods but it's not getting me what I need. Any help is appreciated.You're probably going to need a helper table for this that lists the dates that were eligible for attendance. Let's say you had one with a column called "date", you could do
select
sch.id, helper.date, sch.period
from
sch
cross join datehelper
where
not exists (select 1 from att where att.id = sch.id and att.period=sch.period and att.date = datehelper.date)
If you're just starting out, though, you might want to look into a different schema that made the existence of a class more of a first class concept.
posted by Bezbozhnik at 12:38 PM on October 7, 2009
select
sch.id, helper.date, sch.period
from
sch
cross join datehelper
where
not exists (select 1 from att where att.id = sch.id and att.period=sch.period and att.date = datehelper.date)
If you're just starting out, though, you might want to look into a different schema that made the existence of a class more of a first class concept.
posted by Bezbozhnik at 12:38 PM on October 7, 2009
select q.id,q.date,q.period from (select distinct sch.*, date from sch, att) q left join att on q.id=att.id and q.period=att.period and q.date=att.date
where att.period is null
will give you the results you want, but won't necessarily work in general, since it assumes that any day that any student was in attendance, every student is expected to be there for all their assigned periods. So something like a Saturday class just for certain students will show all the other students as being absent that day. Or in the case where every student is absent for a day (unlikely but technically possible) that query won't show any of them as being absent.
What Bezbozhnik said above would make sense if you can get a helper table for dates that should be looked at.
posted by reptile at 12:50 PM on October 7, 2009
where att.period is null
will give you the results you want, but won't necessarily work in general, since it assumes that any day that any student was in attendance, every student is expected to be there for all their assigned periods. So something like a Saturday class just for certain students will show all the other students as being absent that day. Or in the case where every student is absent for a day (unlikely but technically possible) that query won't show any of them as being absent.
What Bezbozhnik said above would make sense if you can get a helper table for dates that should be looked at.
posted by reptile at 12:50 PM on October 7, 2009
Response by poster: I'm trying reptile's suggestion first, and I think am getting what I want but the query is taking a couple minutes to run. Is this query
essentially a cross join? The
posted by DakotaPaul at 2:41 PM on October 7, 2009
select distinct sch.*, date from sch, att
essentially a cross join? The
sch
table has ~900 rows and att
has ~45,000. I think that's what's taking so long. Fortunately, I'll rarely need to run this.posted by DakotaPaul at 2:41 PM on October 7, 2009
break that query down a little further:
posted by jockc at 3:06 PM on October 7, 2009
select a.id, a.[date], a.period from (select sch.id, dates.[date], sch.period from (select distinct [date] from att) dates, sch) a left join att on a.id = att.id and a.[date] = att.[date] and a.period = att.period where att.period is null
posted by jockc at 3:06 PM on October 7, 2009
Seems to me that the difficulty is caused by having the same name for two different things. If the "period" column of sch was called "period_scheduled" and the "period" column of att was called "period_attended" then you could just do a cartesian product of both tables and select rows where period_attended is null and period_scheduled isn't. Or am I missing something vital?
posted by flabdablet at 5:10 PM on October 7, 2009
posted by flabdablet at 5:10 PM on October 7, 2009
Seems to me that the difficulty is caused by having the same name for two different things. If the "period" column of sch was called "period_scheduled" and the "period" column of att was called "period_attended" then you could just do a cartesian product of both tables and select rows where period_attended
That shouldn't affect anything, you can refer to the rows as sch.period and att.period.
posted by delmoi at 5:43 PM on October 7, 2009
That shouldn't affect anything, you can refer to the rows as sch.period and att.period.
posted by delmoi at 5:43 PM on October 7, 2009
You're quite right; I'm an idiot.
posted by flabdablet at 5:58 PM on October 7, 2009
posted by flabdablet at 5:58 PM on October 7, 2009
When I first saw this question, my immediate reaction was: wtf? How can anybody expect a database to return results that aren't in it? and it now seems to me that the instinct behind that reaction is basically sound.
The trouble is that you're trying to run a query for absences against a database that's only been built to record attendances. There simply isn't enough information in the database to let you do that reliably. An absence is not simply the inverse of an attendance - it's the lack of an expected attendance, and at the moment, you don't have a way to specify those expectations completely.
Assuming that most students will attend most of their scheduled periods, your data would probably get much smaller if you recorded only absences, and kept a separate roll table with date and period columns to let you back-construct attendance data: if the roll was marked for a given period on a given date, then all students scheduled to attend that period and not marked absent have attended.
posted by flabdablet at 6:53 PM on October 7, 2009
The trouble is that you're trying to run a query for absences against a database that's only been built to record attendances. There simply isn't enough information in the database to let you do that reliably. An absence is not simply the inverse of an attendance - it's the lack of an expected attendance, and at the moment, you don't have a way to specify those expectations completely.
Assuming that most students will attend most of their scheduled periods, your data would probably get much smaller if you recorded only absences, and kept a separate roll table with date and period columns to let you back-construct attendance data: if the roll was marked for a given period on a given date, then all students scheduled to attend that period and not marked absent have attended.
posted by flabdablet at 6:53 PM on October 7, 2009
Response by poster: Thanks for all the comments, everyone. Unfortunately, I'm stuck with this schema and method of recording attendance. Admin staff at the school sites want to know what teachers are not recording attendance at all, so I'm left to look for data that doesn't exist.
posted by DakotaPaul at 7:02 PM on October 7, 2009
posted by DakotaPaul at 7:02 PM on October 7, 2009
Yes, the comma syntax is a cross join. Apologies for the messy query, I just dashed it off quickly. It's probably going to run slowly just because the database schema doesn't easily lend itself to doing what you're trying to do, unfortunately.
posted by reptile at 7:04 PM on October 7, 2009
posted by reptile at 7:04 PM on October 7, 2009
Period 14,287 isn't missing because it doesn't occur anywhere in the sch table.
posted by flabdablet at 8:43 PM on October 7, 2009
posted by flabdablet at 8:43 PM on October 7, 2009
I just tried out jockc's solution using sqlite, and it appears to work as intended, and it should run fast as well: it's based on a cross join between the distinct dates from att and all of sch, which (assuming close to full attendance) will be about the same number of rows as att itself rather than a 45000 x 900 row monster.
posted by flabdablet at 10:52 PM on October 7, 2009
posted by flabdablet at 10:52 PM on October 7, 2009
The following uses a Common Table Expression with a slightly different approach than jockc. It generates a CTE with the expected attendance and compares that to the actual attendance and pulls out the missing periods.
;WITH PerfectAttendance (StudentID, Period, [Date]) AS(posted by SyntacticSugar at 8:37 AM on October 9, 2009
SELECT
sch.StudentID
, sch.Period
, att.[Date]
FROM sch
inner join att
on sch.StudentID=att.Studentid
group by
sch.studentid
, att.[Date]
,sch.period
)
SELECT
sch.StudentID
, sch.Period AS MissingPeriod
, sch.DayOfwork AS DayInQuestion
, att.[Date]
FROM PerfectAttendance sch
left join att
on sch.StudentID=att.Studentid and sch.[Date]=att.[Date] and sch.Period=att.Period
where att.[Date] IS NULL
Arse, 'sch.DayOfWork' in the final select should be 'sch.[Date]'. I changed the column names when I was playing around with solutions to avoid escaping Date all the time. Oh, and the final 'att.[Date]' is redundant (since it's NULL.)
So
So
SELECTposted by SyntacticSugar at 8:43 AM on October 9, 2009
sch.StudentID
, sch.Period AS MissingPeriod
, sch.[Date] AS DayInQuestion
--, att.[Date]
FROM PerfectAttendance sch
left join att
on (sch.StudentID=att.Studentid and sch.[Date] =att.[Date] and sch.Period=att.Period)
where att.[Date] IS NULL
Can't seem to stop poking at this.
Anyway, forgot that the EXCEPT operator was available in SQL 2005, so here's another solution that should be a little more efficient.
Create the CTE as before then:
Anyway, forgot that the EXCEPT operator was available in SQL 2005, so here's another solution that should be a little more efficient.
Create the CTE as before then:
SELECTposted by SyntacticSugar at 6:33 AM on October 19, 2009
pa.StudentID
, pa.Period AS MissingPeriod
, pa.[Date] AS DayInQuestion
FROM PerfectAttendance pa
EXCEPT
SELECT
att.StudentID
,att.Period
,att.[Date]
FROM
att
This thread is closed to new comments.
posted by mbrubeck at 12:36 PM on October 7, 2009 [1 favorite]