SQL JOINs are driving me crazy
September 1, 2010 7:55 AM   Subscribe

Help me dig myself out of an SQL hole involving crazy multiple joins.

As part of the process of replacing some old code that used an incredibly slow nested select, I've ended up with a query that looks like this:

SELECT r3.id AS r3_id, r2.id AS r2_id, r1.id AS r1_id
FROM
table_r r3
LEFT JOIN (
  table_r r2
  INNER JOIN (
    table_r r1
    INNER JOIN table_d d ON r1.r_id = d.r_id
  ) ON r2.r_id = r1.parent_id
) ON r3.r_id = r2.r_id
WHERE d.d_id = 3


So in the innermost join, I'm looking for the records in table_r (copy r1) which have a relationship with a subset of records from table_d.

In the next join out, I'm looking for records in a second copy of table_r (r2) whose main index (r_id) matches the parent index (parent_id) of the records from the previous join.

Then I'm trying to do a LEFT JOIN with a third copy of table_r (r3), simply matching r_id with the r_id of the previous join. The idea of this outermost join is to get ALL of the records from table_r, but to then do the equivalent of a 'NOT IN' select by using a further condition (not yet in my query) to determine which records in r3 have NULLs for r2_id.

The problem is that the LEFT JOIN is not giving me the whole of table_r. It's giving me the same subset of records that I get without the final join - in other words, the same thing as an INNER JOIN. So whereas I'm expecting 1208 records, I get 508.

I know I must be doing something screwy, but this is starting to go beyond my SQL comfort zone, and I'm getting confused.

So please help, O clear-headed gods of complicated multiple-self-joins...
posted by le morte de bea arthur to Computers & Internet (14 answers total)
 
You may need to specify LEFT OUTER JOIN, but you don't specify which RDBMS you are using so I can't be sure. Example data would be helpful too.
posted by mkb at 7:58 AM on September 1, 2010


Consider cross-posting at StackOverflow
posted by schmod at 7:59 AM on September 1, 2010


Response by poster: Oops. Sorry, I'm using MySQL 5.1
posted by le morte de bea arthur at 8:02 AM on September 1, 2010


You have a field from d in the WHERE clause - won't the extra records have NULL there due to the LEFT JOIN? I'm not sure I am parsing your query entirely, but i think you need to add OR ISNULL(d.d_id) in your final WHERE.
posted by Dr Dracator at 8:20 AM on September 1, 2010


I've completely lost track of what you're actually trying to accomplish here. It sounds like you're (ultimately) trying to find rows in table_r who don't have "children" with matching r_id in table_d. If that's the case, you might be able to just do:
SELECT r.r_id
FROM table_r r LEFT OUTER JOIN table_d d
	ON r.parent_id = d.r_id
	AND d.d_id = 3
WHERE d.d_id IS NULL

posted by uncleozzy at 8:27 AM on September 1, 2010


Please tell us what you're actually trying tro accomplish, rather than showing us what isn't working.
posted by orthogonality at 8:31 AM on September 1, 2010


Response by poster: You have a field from d in the WHERE clause - won't the extra records have NULL there due to the LEFT JOIN?

That's my aim. Because then I want to discard all but the extra records using another condition (I want the equivalent of a 'NOT IN' select).

I'll try to simplify what I'm after as much as I can.

What I want is:

(whole of table_r) MINUS (records in table_r which are child records of other records in table_r which are linked to a set of records in table_d)
posted by le morte de bea arthur at 8:35 AM on September 1, 2010


Best answer: The problem is that the LEFT JOIN is not giving me the whole of table_r. It's giving me the same subset of records that I get without the final join - in other words, the same thing as an INNER JOIN. So whereas I'm expecting 1208 records, I get 508.

Well, naturally. In the left outer, yes, you're getting all records in r2, regardless of whether they join r2. But then you're asking for the subset of that where d.d_id = 3. That filters out all records where d doesn't equal three, in particular all rows where d is null, either because the actual d value is null, or because there's no join from r3 to r2. So you might as well replace the outer-most left join with an inner join.

Indeed, if table_r.r_id is unique, the entire outer-most join (the left join) is pointless, as you're joining (at best) each row of table_r to itself.

Perhaps you want to do this: filter on d in r2, and then do a full join to r1. That would give you all rows of table_r, with non-null duplicate columns where d is not null.

Unfortunately, you can't do a full join in MySQL.

So what you might do is take your r2, and union it with select *, null from table r a where not exists ( select b.id from (that inner r2 query) b where b.id = a.id). That's just, "give me all r2s, and then any table_r that isn't in r2." You can't just let the union distinct it for you, since you're pulling in d, which is why you have the not exists.

That means running the inner query twice, because MySQL also lacks CTEs. I'd make the inner query a view, to simplify and modularize the code, and to make testing easier.
posted by orthogonality at 8:47 AM on September 1, 2010


Best answer: What I want is: (whole of table_r) MINUS (records in table_r which are child records of other records in table_r which are linked to a set of records in table_d)

Ok, write the bit asfter the minus as a view

create view linked_children as
select r2.* from
table_r r2
INNER JOIN (
table_r r1
INNER JOIN table_d d ON r1.r_id = d.r_id
) ON r2.r_id = r1.parent_id;

(That is, children whose parents join d)

then

select * from table_r a
where not exists ( select * from linked_children b where b.id = a.id);

Note that this eliminates the children whose parents link to d, not the parents who link to d themselves.
posted by orthogonality at 8:55 AM on September 1, 2010


I don't know if I'm reading your objective right, but this seems to be correct:

SELECT * FROM table_r a WHERE a.id NOT IN
(SELECT b.id FROM table_r b INNER JOIN table_d ON b.id = d.id)

posted by drewski at 10:50 AM on September 1, 2010


(drewski: except for some weird corner cases in Oracle, not in risks being less performant than not exists; but your solution also misses the whole bit about child records.)
posted by orthogonality at 11:34 AM on September 1, 2010


(drewski: except for some weird corner cases in Oracle, not in risks being less performant than not exists; but your solution also misses the whole bit about child records.)


Ah, my bad, I read it as that all child records had related records in table_d, thus any ID that existed in both table_r and table_d were child records.
posted by drewski at 11:48 AM on September 1, 2010


Response by poster: Thanks all, especially orthogonality who has got me pretty close to where I need to be. Here's an outline of the solution so far:

First, create a view that gives me the IDs of records from table_r that have already been 'used' - i.e. their child records are linked to table_d:

CREATE OR REPLACE VIEW used_r AS
SELECT r1.r_id,
  d.d_id
FROM table_r r1
   INNER JOIN ( table_r r2
    INNER JOIN table_d d
      ON r2.r_id = d.r_id)
    ON r1.r_id = r2.parent_id
WHERE (a few conditions);

Then I do a LEFT JOIN (and an INNER JOIN, but that's less important):

SELECT (a load of stuff)
FROM (table_r r
  INNER JOIN table_g g
    ON r.g_id = g.g_id)
  LEFT JOIN used_r u
    ON r.r_id = u.r_id
WHERE u.r_id IS NULL
  AND (some conditions)
ORDER BY (some fields from table_g and table_r)

That gets me exactly the set of results I was looking for (records from table_r that aren't in the result from the view used_r), and successfully replaces the ugly nested select that was my starting-point.

The only issue left is that the final ORDER BY clause is killing the query. Ordering by two fields from table_g is fine - the query takes about 0.5s, which is acceptable. But adding three fields from table_r to the ORDER BY adds on a crazy 20 seconds.

So far I've tried ordering on just a single field from table_r, but it seems that the mere act of wanting to order by anything from table_r is instant death to the query.

Somewhere down the line there'll be a solution, although I may end up just pulling the results into an array in PHP and sorting that way...
posted by le morte de bea arthur at 4:49 AM on September 3, 2010


Response by poster: And the eventual solution turned out to be using a temporary table rather than a view.
So:

CREATE TEMPORARY TABLE used_table(u_id INT NOT NULL , PRIMARY KEY (u_id));

INSERT INTO used_table
SELECT r1.r_id,
  d.d_id
FROM table_r r1
  INNER JOIN ( table_r r2
    INNER JOIN table_d d
      ON r2.r_id = d.r_id)
    ON r1.r_id = r2.parent_id
WHERE (a few conditions);


then replace 'LEFT JOIN used_r u' with 'LEFT JOIN used_table u' in the final query.

As if by magic, a query that was taking 21 seconds now takes a perfectly lovely 0.02 seconds.

(Just in case, by some miraculous cosmic quirk, anyone ever needs this information).
posted by le morte de bea arthur at 5:51 AM on September 3, 2010 [1 favorite]


« Older crash course in the high-end life   |   What trees are native to the Jersey City region of... Newer »
This thread is closed to new comments.