SQL Query Question
December 9, 2005 9:44 AM   Subscribe

SQL_Filter: I have two tables. They both share a common ID field, but one of the tables as multiple rows of that ID (it's a history table that contains revisions). I want only 1 row to be returned when I join the tables...

Active Table has IDs like 1,2,3,4, etc... Archive Table has 1,1,1,2,3,3,3,3,4,4, etc... Unfortunately, archive table does not have revision_id or something that would make my job easy.
So, when I join Active Table to Archive Table, I don't want 10 rows returned (because there are 10 entries in the archvie table) I want only 4 rows returned (because there are only 4 rows in the active table) I don't care which row in the archive table is joined. Is there a quick and dirty way to do this in SQL? Thanks (FYI, I'm doing this in Access 2003)
posted by apark to Technology (21 answers total)
 
it's non-standard, but postgresql at least has select distinct on (you have to order by those (and other?) columns, and it simply takes the first). it's dirty, but it works.

so to reduce your Archive table to single elements you do something like
select distinct on (ID) * from Archive order by (ID, x, y);
(may have syntax a bit wrong, but you get the idea) - then you can construct the join with that in a subquery.

but i don't know if access has this.
posted by andrew cooke at 10:00 AM on December 9, 2005


First, create a relationship linking the ID field in Archive to Active. Create a new query with the, Active and Archive tables. Add the "Total:" row to the design view by clicking on the summation button "∑"

Choose "Group By" for all fields (this is essentially a DISTINCT if you're familiar with that from SQL.) For those fields in Archive that aren't distinct, choose "First" from the Total: options. This will choose the first value from the joined table.
posted by Loser at 10:00 AM on December 9, 2005


You do care what row in the archive table is joined; you probably want the latest one. Is there a date in the archive table?

Your question is simple:

SELECT *
FROM active LEFT JOIN archive ON (active.id=archive.id)
GROUP BY archive.id;

The GROUP BY means something like "give me only one row for each group of rows with the same id". BUT, it is not specified which of the archive rows you'll get. So you might get the latest, you might not, with the above query. This is not what you want, I assure you. You'll want to add some extra criteria to the query to ensure you get the right archive row (if you need data from that table at all, you need the right data).
posted by jellicle at 10:02 AM on December 9, 2005


Is there a unique identifier (primary key) for the Archive Table you could sub-order by?

SELECT active_table.*, archive_table.field_name FROM active_table, LEFT JOIN archive_table ON (active_table.id = archive_table.old_id) WHERE where_clause ORDER BY active_table.sort_field, archive_table.id DESC

For my database, we have a customers table, and a customers change log. Any time a customers record is changed, the entire record is copied to the change log, and the new record replaces the single active record for that customer in the customers table. To parallel my example query, here's what I'd use:

SELECT customers . * , customer_change_log.changed_by
FROM customers
LEFT JOIN customer_change_log ON ( customers.customer_id = customer_change_log.customer_id )
WHERE customers.customer_id = 20282
ORDER BY customer_change_log.change_id DESC
LIMIT 0 , 1

(Incidentally, testing this query, I forgot the WHERE clause, and the server nearly tanked, trying to run a massive JOIN query on 656,000 records and 66,000 change log entries.)

My customer id is 20282, so that was my WHERE clause.

This was done in MySQL, but worked just like you wanted. Use the archived table's primary key descending, assuming it's auto incremented. If you have a timestamp field that's set on update or insert, that could work too.
posted by disillusioned at 10:02 AM on December 9, 2005


disillusioned's query above only works if you only want one record. The LIMIT statement effectively cuts off the extra rows from the query, but that's not the right way to do it in the general case (where you might want multiple rows from the active table, but only one row from the archive table for each one in the active table).
posted by jellicle at 10:12 AM on December 9, 2005


You want a Left Join, as jellicle says. You can do this in access with the graphical interface by adding the tables to a new query.

If the relationship between the tables is already established, a line will appear between the two tables. If not, establish the relationship by clicking on the id field in the active table and dragging it over to the id field in the archive table.

Then, right click on the black relationship line, and highlight the "Join Properties" line. You'll have a dialog box that gives you 3 choices: 1) where you include only rows where the joined fields are equal in both tables, 2) where you include all records from one of the tables (the left one) and only those that match in the right one, and 3) where you inlclude all records from the other table (the right one) and only those that match in the left one.

If I understand your question right, you should be able to figure out which of those will give you what you want. My understanding is that, if you choose the left join, say, it will return whichever record in the archive table happens to occur first. If the records aren't sorted, that could be any record, so you may want to specify a sort on a date field, for instance so you'll get the most recent field from the archive table.
posted by jasper411 at 10:39 AM on December 9, 2005


Here's an example:
test=# \d askme1                           Table "public.askme1" Column |  Type   |                       Modifiers                        --------+---------+-------------------------------------------------------- id     | integer | not null default nextval('public.askme1_id_seq'::text) name   | text    | Indexes:    "askme1_pkey" PRIMARY KEY, btree (id)test=# \d askme2                             Table "public.askme2"   Column    |  Type   |                       Modifiers                        -------------+---------+-------------------------------------------------------- id          | integer | not null default nextval('public.askme2_id_seq'::text) askme1_id   | integer |  description | text    | Indexes:    "askme2_pkey" PRIMARY KEY, btree (id)test=# select * from askme1; id | name  ----+-------  1 | steve  2 | chris  3 | nate(3 rows)test=# select * from askme2; id | askme1_id | description ----+-----------+-------------  1 |         1 | test 1  2 |         1 | test 2  3 |         1 | test 3  4 |         1 | test 4  5 |         2 | test 4  6 |         2 | test 3  7 |         2 | test 4  8 |         2 | test 1  9 |         3 | test 1 10 |         3 | test 2 11 |         3 | test 3 12 |         3 | test 4(12 rows)
Now that we have all that out of the way, here's what you want to do:
test=# SELECT * FROM askme1 AS atest-# JOIN askme2 AS b ON (a.id = b.askme1_id)test-# WHERE b.id =test-# (SELECT c.id FROM askme2 AS ctest(# WHERE c.askme1_id = b.askme1_idtest(# ORDER BY c.id LIMIT 1); id | name  | id | askme1_id | description ----+-------+----+-----------+-------------  1 | steve |  1 |         1 | test 1  2 | chris |  5 |         2 | test 4  3 | nate  |  9 |         3 | test 1(3 rows)
As you can see though, this depends on you having 1) a unique way to identify the rows in askme2, and b) a unique way to sort these rows. Without those conditions, I don't think it's possible to solve your problem in a single SQL statement.
posted by sbutler at 11:01 AM on December 9, 2005


Changing it from LIMIT 0,1 to LIMIT 0,4 should still work, for my query... The ORDER BY is based on the ID being matched from the active table.
posted by disillusioned at 11:42 AM on December 9, 2005


andrew cooke writes "it's non-standard, but postgresql at least has select distinct"


The DISTINCT quantifier (and its converse, the ALL quantifier) isn't non-standard; are part of SQL-92 (and ANSI SQL), but SQL-92 at 6.5 (set function specification), states that under Intermediate Level compliance, :
If a <general set function> specifies DISTINCT, then the <value expression> shall be a <column reference>.
But I think Andrew Cooke is correct that "SELECT DISTINCT ON" is non-standard, as it allows arbitrary expressions, which goes beyond the Intermediate Level compliance requirements quoted above. Sybase T-SQL allows DISTINCT to be applied to a column list.

And Access adds the non-standard keyword DISTINCTROW to specify that the whole row is distinct.

Jellicle's left outer join makes sure that Active table rows that have no corresponding Archive table rows are returned, but it does nothing to limit join to a single archive row. Strictly, the Group By clause should only reference columns named in the select list; Sybase T-SQL relaxes this restriction, I don't know if Access does.

So what we need is the left outer join, to get rows not represented in archive, and DISTINCTROW to get only one row per id. I don't have a copy of Access to test this on, so you'll have to try tis out yourself:
select DISTINCTROW * from Active a LEFT OUTER JOIN Archive b ON( a.id=b.id )
posted by orthogonality at 12:31 PM on December 9, 2005


orthogonality: but won't each row be distinct because (presumably) the rows in his archive table are different? That is, if Archive looks like my askme2 table then most of the rows are distinct.
posted by sbutler at 12:53 PM on December 9, 2005


sbutler writes "orthogonality: but won't each row be distinct because (presumably) the rows in his archive table are different?"

If so, yes. My understanding from the question was the Archive rows were not distinct. If they are, the solution is to select the right archive row (for each id). But the questioner, apark, hasn't told us what the Archive table looks like.

Ok, so presumably the Archive is (conceptually, you probably can't actually do this in Access) populated by a pre-update trigger on Active, to preserve that row's Active pre-update state. (insert into archive old.*) And so, presumably apark wants the most recently added archive row for that id.

In which case, apark is screwed, unless archive contains a synthetic key (and that's chancy unless you can demonstrate synthetic keys are always given out monotonically) or a timestamp.

Because there is no point at all in returning some arbitrary Archive row, because nobody will know what that data means and much worse they'll assume it was the last pre-update state, which it won't be, which will screw things up royally. ("Huh, Active.salary is $100000 and Archive.salary is $20000. Was his last raise +400%? No, $20000 was his salary twenty years ago, he's had 19 raises and ten promotions since.")

If this is the case, apark needs to do a table redesign post-haste, and tell his boss to tell whomever has fiduciary responsibility that the data needs to be reconciled by hand.
posted by orthogonality at 1:37 PM on December 9, 2005


I agree with orthogonality, apark. Unless your question is formatted improperly, your biggest problem is atrocious table design.
posted by sbutler at 1:52 PM on December 9, 2005


Yeah. The number one rule in databases is, if you lie to the database, it'll lie right back.

You told the database that order was unimportant for archive. That was a "lie". Now when you ask you database for the archive record, it "lies" and gives you multiple records.

There's a really great but obscure database design book (it's not about SQL design, but a "semantic" database) that basically approaches database design as a Socratic search for the truth. and that's how you have to do your design if you don't want it to bite your ass later. You have to figure out, for your situation, "what is truth" and then how precisely you need to model that truth.


Example: sbutler lives somewhere. What is the truth about where he lives? Can we successfully approximate where he lives with a street address? Or do we need a latitude-longitude-altitude tuple?

The answer is, what do we need to do with that information? Do we have to mail sbulter a letter, or find the precinct where he votes, or do we have to route a cell phone call to his home, or do we have to beam him up to the Starship Enterprise? It may be true sbutler lives at 123 Main Street, but that "truth" only suffices to mail the letter -- for the other operations it's either insufficiently precise or we lack data to join to an address.

And it certainly won't suffice for a background check, as where he lives is (implicitly) where he lives now. And it may or may not work for taxing sbulter, as it may or may not be his legal residence. And even if we know it is his legal residence, that still won't suffice for finding where he votes, because we don't know if he's been a resident long enough to qualify to vote.

Ok, say we figure all that out. We add sbutler to our database along with a sufficiently truthful and precise address for sbultler. So now we have one more address in out database than we did before, right?

Wrong. sbutler just moved in with his girlfriend. We have the same number of addresses as we had before, because the girlfriend and her address were already in the database. And for some things, we want to consider people and addresses (two people, one shared address) but for other things, we want counts of households. So (in this case) the address can't be attribute data, it's got to be referential, or we have the same problem apark has today.


Database design is about modeling truth, and it's difficult, because you have to figure out what models truth, for each different requirement. At the same time, you can't model so precisely you capture all aspects of the truth -- that world make the database as "big" as the world being modelled. So you have to find the right truth.
posted by orthogonality at 2:33 PM on December 9, 2005


Holy cow, I've never seen such a simple question over complicated as much. For what it's worth, here's the SQL (from Access) for the query design I mentioned in my original post:

SELECT Active.id, Active.name, First(Archive.name) AS FirstOfname, First(Archive.date) AS FirstOfdate
FROM Active INNER JOIN Archive ON Active.id = Archive.id
GROUP BY Active.id, Active.name;

Access also has a Last() function in case you want, I dunno, the last record from Archive which matches an id in Active...
posted by Loser at 3:30 PM on December 9, 2005


Loser writes "Holy cow, I've never seen such a simple question over complicated as much. For what it's worth, here's the SQL (from Access) for the query design I mentioned in my original post.... First(Archive.date)"

Oh bloody hell. If there's s date in the Archive, he just does a having date = min(date) group by archive.id, archive.date

The problem with "First" and "Last" is that their results are entirely arbitrary unless there is a date or a synthetic key or a (clustered) index. Arbitrary, as in, you might well get one result one time, and another result the next time the query is run.

What are you? A Loser? ;)
posted by orthogonality at 6:55 PM on December 9, 2005


Does Access have the ROWID pseudo-column? If so, you can do something like:

SELECT active.*, archive.*
WHERE archive.pkey = active.pkey
AND archive.rowid = (
SELECT MAX(rowid) FROM archive2
WHERE pkey = active.pkey)
posted by SPrintF at 7:35 PM on December 9, 2005


If and only if rowid is monotonic.
posted by orthogonality at 7:46 PM on December 9, 2005






orthogonality writes "There's a really great but obscure database design book (it's not about SQL design, but a 'semantic' database) that basically approaches database design as a Socratic search for the truth."

Ooh, can you share the title, please? (Sounded like Kent's Data and Reality... except for the 'obscure' description... so now I'm burning with curiosity.)
posted by Pigpen at 8:45 PM on December 9, 2005


If and only if rowid is monotonic.

This is only a requirement if your goal is to retrieve the last "archive" row created. If any "archive" row will do (which appears to be the case here), then the only requirement is that ROWID be unique for a value of the primary key.
posted by SPrintF at 11:05 PM on December 9, 2005


Pigpen writes "Ooh, can you share the title, please?"

This was many years ago, before I ever did anything with databases professionally. I found it on a remainders table and just read it for fun.

Ok, I think this is it: J. Patrick Thompson, Data With Semantics: Data Models and Data Management.


SPrintF writes "If any 'archive' row will do"

Yes, but semantically, what's the meaning that any row of archive will do? That for any active key, the projection (columns) of archive we want should be the same. If they are not, what are we getting?
posted by orthogonality at 6:38 AM on December 10, 2005


orthogonality writes "J. Patrick Thompson, Data With Semantics: Data Models and Data Management.


Thanks! I just bought a copy for $1 off ebay. Not to be totally greasy ;-), but I think you write great data modelling/database-related posts, so I'm interested in any book that might have brought to bear upon said posts.
posted by Pigpen at 5:16 PM on December 10, 2005


« Older Walking Wounded   |   DMV help in New York? Newer »
This thread is closed to new comments.