Finding rows that don't exist in an SQL Server 2005 table
October 7, 2009 12:24 PM   Subscribe

I can't seem to build a query in SQL Server 2005 that returns rows that are not in a table.

I have a table called sch that includes student IDs and the periods for which the students are scheduled:

id    period
111   1
111   2
111   3
111   4
111   5

Table att includes student IDs and the dates and periods they attended:

id    date         period
111   2009-10-01   1
111   2009-10-01   2
111   2009-10-01   3
111   2009-10-01   4
111   2009-10-01   5
111   2009-10-02   1
111   2009-10-02   2
111   2009-10-02   3
111   2009-10-02   5
111   2009-10-03   1
111   2009-10-03   2
111   2009-10-03   3
111   2009-10-03   5

In this example, the attendance for student 111, period 4, is missing for 2009-10-02 and 2009-10-03. How can I build a query to get that result? Ideally, the query would return:

id    date         period
111   2009-10-02   4
111   2009-10-03   4

I've tried a left join on the tables between the IDs and periods but it's not getting me what I need. Any help is appreciated.
posted by DakotaPaul to Computers & Internet (17 answers total) 1 user marked this as a favorite
 
You could use the technique from this StackOverflow answer to generate a "table" that contained every date in a given range. Then join that table to "att" and "sch" and select where att.id is null.
posted by mbrubeck at 12:36 PM on October 7, 2009 [1 favorite]


You're probably going to need a helper table for this that lists the dates that were eligible for attendance. Let's say you had one with a column called "date", you could do

select
sch.id, helper.date, sch.period
from
sch
cross join datehelper
where
not exists (select 1 from att where att.id = sch.id and att.period=sch.period and att.date = datehelper.date)

If you're just starting out, though, you might want to look into a different schema that made the existence of a class more of a first class concept.
posted by Bezbozhnik at 12:38 PM on October 7, 2009


select q.id,q.date,q.period from (select distinct sch.*, date from sch, att) q left join att on q.id=att.id and q.period=att.period and q.date=att.date
where att.period is null


will give you the results you want, but won't necessarily work in general, since it assumes that any day that any student was in attendance, every student is expected to be there for all their assigned periods. So something like a Saturday class just for certain students will show all the other students as being absent that day. Or in the case where every student is absent for a day (unlikely but technically possible) that query won't show any of them as being absent.

What Bezbozhnik said above would make sense if you can get a helper table for dates that should be looked at.
posted by reptile at 12:50 PM on October 7, 2009


I'm trying reptile's suggestion first, and I think am getting what I want but the query is taking a couple minutes to run. Is this query
select distinct sch.*, date from sch, att
essentially a cross join? The sch table has ~900 rows and att has ~45,000. I think that's what's taking so long. Fortunately, I'll rarely need to run this.
posted by DakotaPaul at 2:41 PM on October 7, 2009


break that query down a little further:
select a.id, a.[date], a.period from 
(select sch.id, dates.[date], sch.period from (select distinct [date] from att) dates, sch) a
left join att on a.id = att.id and a.[date] = att.[date] and a.period = att.period
where att.period is null


posted by jockc at 3:06 PM on October 7, 2009


Seems to me that the difficulty is caused by having the same name for two different things. If the "period" column of sch was called "period_scheduled" and the "period" column of att was called "period_attended" then you could just do a cartesian product of both tables and select rows where period_attended is null and period_scheduled isn't. Or am I missing something vital?
posted by flabdablet at 5:10 PM on October 7, 2009


Seems to me that the difficulty is caused by having the same name for two different things. If the "period" column of sch was called "period_scheduled" and the "period" column of att was called "period_attended" then you could just do a cartesian product of both tables and select rows where period_attended

That shouldn't affect anything, you can refer to the rows as sch.period and att.period.
posted by delmoi at 5:43 PM on October 7, 2009


You're quite right; I'm an idiot.
posted by flabdablet at 5:58 PM on October 7, 2009


When I first saw this question, my immediate reaction was: wtf? How can anybody expect a database to return results that aren't in it? and it now seems to me that the instinct behind that reaction is basically sound.

The trouble is that you're trying to run a query for absences against a database that's only been built to record attendances. There simply isn't enough information in the database to let you do that reliably. An absence is not simply the inverse of an attendance - it's the lack of an expected attendance, and at the moment, you don't have a way to specify those expectations completely.

Assuming that most students will attend most of their scheduled periods, your data would probably get much smaller if you recorded only absences, and kept a separate roll table with date and period columns to let you back-construct attendance data: if the roll was marked for a given period on a given date, then all students scheduled to attend that period and not marked absent have attended.
posted by flabdablet at 6:53 PM on October 7, 2009


Thanks for all the comments, everyone. Unfortunately, I'm stuck with this schema and method of recording attendance. Admin staff at the school sites want to know what teachers are not recording attendance at all, so I'm left to look for data that doesn't exist.
posted by DakotaPaul at 7:02 PM on October 7, 2009


Yes, the comma syntax is a cross join. Apologies for the messy query, I just dashed it off quickly. It's probably going to run slowly just because the database schema doesn't easily lend itself to doing what you're trying to do, unfortunately.
posted by reptile at 7:04 PM on October 7, 2009


You need to extend your schema with additional tables that would describe precisely why you know the following statement is true:

"In this example, the attendance for student 111, period 4, is missing for 2009-10-02 and 2009-10-03."

How do you know that? What logic are you performing in your head to know that period 4 is missing? Why isn't period 14,287 also missing?
posted by odinsdream at 7:20 PM on October 7, 2009


Period 14,287 isn't missing because it doesn't occur anywhere in the sch table.
posted by flabdablet at 8:43 PM on October 7, 2009


I just tried out jockc's solution using sqlite, and it appears to work as intended, and it should run fast as well: it's based on a cross join between the distinct dates from att and all of sch, which (assuming close to full attendance) will be about the same number of rows as att itself rather than a 45000 x 900 row monster.
posted by flabdablet at 10:52 PM on October 7, 2009


The following uses a Common Table Expression with a slightly different approach than jockc. It generates a CTE with the expected attendance and compares that to the actual attendance and pulls out the missing periods.
;WITH PerfectAttendance (StudentID, Period, [Date]) AS(
SELECT
sch.StudentID
, sch.Period
, att.[Date]
FROM sch
inner join att
on sch.StudentID=att.Studentid
group by
sch.studentid
, att.[Date]
,sch.period
)


SELECT
sch.StudentID
, sch.Period AS MissingPeriod
, sch.DayOfwork AS DayInQuestion
, att.[Date]
FROM PerfectAttendance sch
left join att
on sch.StudentID=att.Studentid and sch.[Date]=att.[Date] and sch.Period=att.Period
where att.[Date] IS NULL
posted by SyntacticSugar at 8:37 AM on October 9, 2009


Arse, 'sch.DayOfWork' in the final select should be 'sch.[Date]'. I changed the column names when I was playing around with solutions to avoid escaping Date all the time. Oh, and the final 'att.[Date]' is redundant (since it's NULL.)
So
SELECT
sch.StudentID
, sch.Period AS MissingPeriod
, sch.[Date] AS DayInQuestion
--, att.[Date]
FROM PerfectAttendance sch
left join att
on (sch.StudentID=att.Studentid and sch.[Date] =att.[Date] and sch.Period=att.Period)
where att.[Date] IS NULL
posted by SyntacticSugar at 8:43 AM on October 9, 2009


Can't seem to stop poking at this.
Anyway, forgot that the EXCEPT operator was available in SQL 2005, so here's another solution that should be a little more efficient.
Create the CTE as before then:
SELECT
pa.StudentID
, pa.Period AS MissingPeriod
, pa.[Date] AS DayInQuestion
FROM PerfectAttendance pa
EXCEPT
SELECT
att.StudentID
,att.Period
,att.[Date]
FROM
att
posted by SyntacticSugar at 6:33 AM on October 19, 2009


« Older Planning for Las Vagas   |   What jobs are outhere for Bachelor of Arts and... Newer »
This thread is closed to new comments.