Help with SQL unions needed
May 10, 2006 12:06 AM   Subscribe

Question for the SQL ninjas in the house. Say you're grabbing data from a bunch of similar tables in a big union. Now say you want the last ten items added anywhere to the database, from any one of those tables. What's the most efficient way to do that?

I've got a working query to pull up similar items from four tables that have similar items. That works great, but I'm concerned when there will be queries pulling hundreds of items from the table, so I'd like to grab say, just the last two items. But how best to do that?

Plain english view of the current query (at the end, it is ordered by date descending to get the most recent stuff at the top):

select stuff from table 1
union
select stuff from table 2
union
...etc

And what I really want in plain english is this:
select top 20 (
select stuff from table 1
union
select stuff from table 2
union
...etc
)

Here's what I've tried and it doesn't work so well:
select top 20 stuff from table 1
union
select top 20 stuff from table 2
union
...etc

It's a horrible waste and the order is all wrong in the output. I'm using Microsoft SQL Server 2000, and nothing I've found online points to working with TOP n records in a union setting.

(this is for MeFi by the way and solving it will help get a major new feature out the door sooner)
posted by mathowie to Computers & Internet (22 answers total)
 
I've posted your question to my SQL guys but they all appear to have gone to bed already. Seriously, what kind of serious geek goes to bed this early. I'll try and collect their answers for you in the morning. Well, noon.
posted by Dunwitty at 12:34 AM on May 10, 2006


Call me stupid, but can't you put your union in a subselect? Even SQL Server gives you subselect, yes?

select [top 20 foo] from
(
select blah ...
union
select blah ...
union
select blah ...
) order by [something]

Dunno about waste, but logically the dbms cannot "know" the top 20 of the whole set without looking at the whole set. If it presents you with the top 2, it still has to order all the rows to identify those two, even if only 2 rows are returned to you.

Or in other words, what do you mean "it's a horrible waste"?
posted by i_am_joe's_spleen at 1:22 AM on May 10, 2006


Select into a temporary table and then select the top 20 from the temp table? That's how I'd do it in sybase (and SQL Server is basically a mnodified version of sybase).
posted by hardcode at 1:48 AM on May 10, 2006


in oracle, it'd be something like this:
select * from (
  select * from a where rownum < 10 order by datebr>
  union
  select * from b where rownum < 10 order by datebr>
) where rownum < 10 order by datebr>

posted by beerbajay at 2:05 AM on May 10, 2006


woah that got nicely mangled. i think you get the idea.
posted by beerbajay at 2:05 AM on May 10, 2006


If this were my project, and this kind of query was likely to happen often, I'd spread the load over all the operations that added data to my tables instead of doing it all on retrieval.

So I'd have a global counter called something like updates-counter, and I'd add a column for it to every table. Every time I changed anything in any row in any table, I'd copy in the present value of updates-counter and auto-increment it.

Then whenever I wanted the last N items added to my database I'd get the present value of updates-counter, subtract N from it, and select on >= that for every table in my union before unioning them.

I'm not fluent in SQL so it would take me a while to figure out the syntax I needed, but I think the basic approach would work.
posted by flabdablet at 2:08 AM on May 10, 2006


beerbajay: are you sure?

I thought the way to get the top N rows was:

select * from (select * from a order by datebr) where rownum
since otherwise all you will be doing is ordering the first 10 rows that the SQL query happens to return...

posted by vacapinta at 2:13 AM on May 10, 2006


Do these items have auto-incrementing IDs?

I'm seeing something like, pseudocode:
get the highest ID number in table 1
get the ten items descending from highest ID to (highest ID - 10)

posted by AmbroseChapel at 2:25 AM on May 10, 2006


mathowie posted "Now say you want the last ten items added anywhere to the database, from any one of those tables."

Your question implies there is a single ordering (e.g., a timestamp column in each of the four tables) over all four tables. If so, there's your order-by clause.

Add the order-by to your second attempt.

How your database engine interprets which query or sub-query "top" "limit" or "order by" attach to, depends on the individual database (and version) and how it parses SQL. You'll have to experiment with this. Sybase 11.x and prior would use the "order by" attached to the first clause in the union, and ignore others; MS=SQL Server descends from Sybase, so it wouldn't surprise me to see the same thing. You'll probably have to disambiguates this with subqueries, as i_am_joe's_spleen and beerbajay suggest.

----

Regardless of how you do this, it's going to be wasteful. It's wasteful because the database design doesn't match the reality it is supposed to be modeling, another example of "if you lie to the database it will lie to (or act inefficiently for) you".

Specifically, your query wants to treat the four tables as containing "the same" homogeneous thing -- one table of one entity type --, but the database design requires the database to treat it as four heterogenous things, modelled by four different tables.

Either all the entities should be in one table (if they really are the same thing), or you should be using a sub-typing relationship, in which part of the entity is represented in one homogeneous table (which also establishes a common synthetic key), and the differing attributes are represented in sub-typing tables joined on the base table key. Then you could do your whole query off the common base table, easily coded and efficiently as well.

----

You may find that a usenet newsgroup specific to SQL or to your preferred SQL dialect gives you faster or better answers.
posted by orthogonality at 2:27 AM on May 10, 2006


Create a view with your union statement inside, then select from that view.
posted by bloo at 2:43 AM on May 10, 2006


bloo writes "Create a view with your union statement inside, then select from that view."


Unfortunately, MS SQL Server 2000 only offers partial support for F081 (SQL-99 Feature F081, Union and Except in views).
posted by orthogonality at 2:51 AM on May 10, 2006


You seek something that cannot be done. All modern databases implement the union as parallel queries that run independant of each other. The only way to apply ordering and filtering is on each individual query (your third example) or after the results are concatenated (your first example).

bloo is correct that creating a view with an associated index will work if your database supports it.

The most efficient way to go about it is to retrieve the top 20 from each table in the union and post-process that list down to your real top 20. Sorry.
posted by Lame_username at 4:10 AM on May 10, 2006


vacapinta wrote: "otherwise all you will be doing is ordering the first 10 rows that the SQL query happens to return..."

You are correct. The rownum filter would be applied prior to the sort operation.
posted by Lame_username at 4:13 AM on May 10, 2006


Unfortunately, MS SQL Server 2000 only offers partial support for F081 (SQL-99 Feature F081, Union and Except in views).

Du you happen to know what exactly "partial support" in this case means? I'm using SQL Server 2000 right now and I never had a problem with union querys in views.
posted by bloo at 5:12 AM on May 10, 2006


bloo writes "Du you happen to know what exactly 'partial support' in this case means? "

No. And that's the rub -- there's no simple way of knowing what might cause a view with unions to break down. (Though given your evidence, I'll guess the problem is with the except keyword -- but again, the point is, we don't know, so we can't code it confidently.)

But the larger point is, matt apparently wants to show users' last 20 posts in all four subdomains. With the right db design, adding a new subdomain should be as easy as adding a row to a table of subdomains; with the right code, adding that row would cause the link to the new subdomain to appear in every page header, right after the Metatalk link. Instead it requires creating a (duplicate) table and copying (duplicative) code. With one table for posts rather than four, finding the last 20 posts is trivial; with four tables, it's a pain and it requires rewriting existing code to add a new subdomain. (What happens when matt adds Music.metafilter.com? He must rewrite this "last 20" query and a whole bunch of others.) Good, thorough, abstracted design saves lots of "clever" coding and maintenance.
posted by orthogonality at 6:00 AM on May 10, 2006


Oh, I just now read the "this is for MeFi" part.

What you said about the abstracted design is usually absolutely true.
We don't know the details of the database design, though. Maybe the tables are all different and there is a good reason for that, but just know he wants to have all the latest posts from all the tables and nobody had thought of that earlier.
posted by bloo at 6:09 AM on May 10, 2006


bloo writes "Maybe the tables are all different"


Well, they're not all different, because matt can do a union of some subset of the columns in the four tables. And they're similar entities simply because matt believes they can be sensibly union'd .

In all likelihood, they're either the same or they have a sub-typing relationship. (Postgresql adds a keyword and some syntactic sugar to make sub-typing easier, but sub-typing can be done in any SQL-92 complaint database using triggers and relational integrity constraints.)
posted by orthogonality at 6:19 AM on May 10, 2006


Yes, you are of course correct, it could have been avoided with a proper database design.

I doubt that matt is willing to change it now, though :)
posted by bloo at 6:22 AM on May 10, 2006


Response by poster: Yeah, it's true that I really should have made one master posts table and one master comments table ages ago, but instead I have a table for each subsite. I don't know what I was thinking way back when.

Anyway, to further explain, it's for the favorites page. Here's mine, which works great pulling a bunch of posts I marked as favorites to test it out.

The problem is I want pagination for it so you can see maybe 10-20 threads per page. As a test, I set it to three on the pagination but the order of returned results still seems messed up and I won't get the most recent at the top (I have 7 mefi threads marked and 1 metatalk thread marked, so grabbing the top three from each should work but the ordering is whack even though I'm ordering by an added-to-favorites date desc.)
posted by mathowie at 7:04 AM on May 10, 2006


Response by poster: Hmm, the second example won't work because any subselect that has an order by must have a TOP n added to the query. I'm trying the third query, but I still can't quite get the right results from it.
posted by mathowie at 7:11 AM on May 10, 2006


Response by poster: I solved the problem finally. I couldn't get my third query to work and filter correctly so I ended up ditching all the subqueries and instead using CF on the SQL output to filter based on pagination of the recordset. I usually do that in SQL, but apparently sql server is kind of weird about sorts in big union queries.

Thanks all for the tips here, I kept running into weird sql syntax errors and didn't understand why ms sql only supported some of the basic union syntax
posted by mathowie at 8:28 AM on May 10, 2006


mathowie writes "Yeah, it's true that I really should have made one master posts table and one master comments table ages ago, but instead I have a table for each subsite. I don't know what I was thinking way back when."

It's relatively simple to merge all four into one table: your queries don't have to change at all if you write four views to replace the old tables, and your inserts/updates just require an additional parameter, the id of the subdomain. If you don't do this explicitly, you'll be doing it implicitly every time you run your query (the tables will be union'd in memory each time, which will be performance intensive).


mathowie writes "I usually do that in SQL, but apparently sql server is kind of weird about sorts in big union queries."

It's not weird: before it can do the sort, it has to retrieve (into memory, essentially a temp table) all the records from all four tables, then apply the sort (or, yes, retrieve one and do insertion sort against it for each of the next three, but that seems unlikely). If by weird you meant it took a long time, that's why. If by weird you meant the sort order was incorrect, that's because your syntax applied the sort only to one of the selects, and not to the union'd set.

On MS SQL Server 2005, using a CTE sorts the result correctly, but seems more awkward than writing a view:

with foo( n, t ) as (
select number, 1 from address
union all
select first_name, 2 from person
union all
select last_name, 4 from person
union all
select name, 3 from street
)
select top 10 n, t from foo
order by 1 desc
posted by orthogonality at 12:31 PM on May 10, 2006


« Older Eight more years of unemployability   |   surround sound for my Mac Mini? Newer »
This thread is closed to new comments.