How can I select this in SQL?
December 17, 2010 6:27 PM Subscribe
SQLfilter: Help me solve a problem that's driving me crazy. I'm working with someone else's data, and I've got a seemingly insurmountable select problem that no amount of googling has helped me with so far.
I've got a table of several hundred thousand rows that looks like this:
My problem is this: I need to retrieve all the records for each PHRASE_ID, but only if the AUTHOR_NAME with the highest TOTAL is a specific value. I can sort the TOTAL field with ORDER BY, so there's no problem there, but I've been wracking my brains trying to figure out how to do the rest of it.
I've pulled the data into a C# program working with a SQLite add-in, but so far the method I'm trying just runs and runs and hangs up my machine. (Which probably isn't terribly surprising; it's looking for over 13,000 phrases, and my algorithm's most likely a horrible mess.)
Can anyone give me a hand?
I've got a table of several hundred thousand rows that looks like this:
PHRASE_ID PHRASE AUTHOR_ID AUTHOR_NAME TOTAL
14201 as true as 41 ShakeQuarto 8
14201 as true as 43 Sharpham 2
14201 as true as 44 Shirley 4
14201 as true as 45 ThomasGoffe 2
14201 as true as 48 Wilkins 1
14201 as true as 49 Wilson 1
14202 we heare of 2 Apocrypha 1
14202 we heare of 7 Chettle 1
14202 we heare of 13 Fletcher 1
14202 we heare of 22 Jonson 2
14202 we heare of 25 Kyd 1
14202 we heare of 40 ShakeFolio 1
14202 we heare of 41 ShakeQuarto 1
My problem is this: I need to retrieve all the records for each PHRASE_ID, but only if the AUTHOR_NAME with the highest TOTAL is a specific value. I can sort the TOTAL field with ORDER BY, so there's no problem there, but I've been wracking my brains trying to figure out how to do the rest of it.
I've pulled the data into a C# program working with a SQLite add-in, but so far the method I'm trying just runs and runs and hangs up my machine. (Which probably isn't terribly surprising; it's looking for over 13,000 phrases, and my algorithm's most likely a horrible mess.)
Can anyone give me a hand?
I am not sure I follow exactly what you mean by "only if the AUTHOR_NAME with the highest TOTAL is a specific value". Can you rephrase that somehow?
Anyway, I think the final query would be something along these lines
posted by jclovebrew at 6:43 PM on December 17, 2010
Anyway, I think the final query would be something along these lines
select phrase.* from phrases phrase inner join ( select phrase_id, max(total) from phrases group by phrase_id) sub on sub.phrase_id = phrase.phrase_id where sub.total = [specific value]
posted by jclovebrew at 6:43 PM on December 17, 2010
Clarify: only if the AUTHOR_NAME (associated with the highest total) is a specific string, or only if the highest TOTAL is a specific number?
posted by Orinda at 6:45 PM on December 17, 2010
posted by Orinda at 6:45 PM on December 17, 2010
This is probably not optimal, but I think it may work. You may also have to proofread any syntax errors I may have made... I didn't go as far as re-creating your table and actually running this...
posted by jclovebrew at 6:54 PM on December 17, 2010
select phrase.* from phrases phrase inner join ( select phrase_id, max(total) from phrases group by phrase_id) sub on sub.phrase_id = phrase.phrase_id where sub.total = [specific value] union select phrase.* from phrases phrase inner join ( select phrase_id, max(total) total from phrases group by phrase_id) sub on sub.phrase_id = phrase.phrase_id and sub.total = phrase.total where phrase.author = '[name of author]'
posted by jclovebrew at 6:54 PM on December 17, 2010
Response by poster: Sorry, I should have been clearer--this is what comes of posting immediately after banging my head against the problem for several hours and then posting at 2 AM. For each PHRASE_ID, I only want the data if the author with the highest TOTAL is one specific person--in this case, Jonson.
For instance, in the sample rows above, I wouldn't want to retrieve any of the rows for PHRASE_ID 14201, since Jonson doesn't appear at all.
Conversely, I would want all the rows for PHRASE_ID 14202, since Jonson appears and has the highest TOTAL.
I hope that's a little clearer. It made sense in my head, but then, it's past 3 AM here.
posted by Mr. Bad Example at 7:02 PM on December 17, 2010
For instance, in the sample rows above, I wouldn't want to retrieve any of the rows for PHRASE_ID 14201, since Jonson doesn't appear at all.
Conversely, I would want all the rows for PHRASE_ID 14202, since Jonson appears and has the highest TOTAL.
I hope that's a little clearer. It made sense in my head, but then, it's past 3 AM here.
posted by Mr. Bad Example at 7:02 PM on December 17, 2010
Best answer: So the only input is the author name?
This seems to work.
posted by jclovebrew at 7:10 PM on December 17, 2010 [1 favorite]
This seems to work.
select allrecords.* from phrase phrase inner join ( select phrase_id, max(total) total from phrase group by phrase_id) sub on sub.phrase_id = phrase.phrase_id and sub.total = phrase.total inner join phrase allrecords on allrecords.phrase_id = phrase.phrase_id where phrase.author_name = ['author name goes here']
posted by jclovebrew at 7:10 PM on December 17, 2010 [1 favorite]
I'm still confused. ShakeQuarto and Shirley have higher TOTALs than Jonson, and Sharpham and ThomasGoffe have the same TOTALs as Jonson.
posted by Phredward at 7:40 PM on December 17, 2010
posted by Phredward at 7:40 PM on December 17, 2010
Phredward: it's per phrase_id.
The question the OP is trying to answer is something like "Show me all the phrases that [some specific author] has the highest frequency of".
ShakeQuarto and Shirley have higher TOTALs than Jonson, but that's for the phrase "as true as". Jonson doesn't have that phrase, so we don't include anything from that group.
If the author was ShakeQuarto, then we'd get all the 14201 rows.
If the author was ThomasGoffe, then we'd get no rows, since ThomasGoffe doesn't have the highest total for any of them.
posted by jclovebrew at 7:50 PM on December 17, 2010
The question the OP is trying to answer is something like "Show me all the phrases that [some specific author] has the highest frequency of".
ShakeQuarto and Shirley have higher TOTALs than Jonson, but that's for the phrase "as true as". Jonson doesn't have that phrase, so we don't include anything from that group.
If the author was ShakeQuarto, then we'd get all the 14201 rows.
If the author was ThomasGoffe, then we'd get no rows, since ThomasGoffe doesn't have the highest total for any of them.
posted by jclovebrew at 7:50 PM on December 17, 2010
Best answer:
Seems to work in MSSQL. Get a table with the max for each phrase (P2), then inner join against it using the phrase_id and max to get all phrase rows which are a max, then filter for your author.
posted by markr at 1:21 AM on December 18, 2010
SELECT P1.* FROM phrases P1
INNER JOIN
(SELECT phrase_id, MAX(total) as max FROM phrases GROUP BY phrase_id) P2
ON P1.phrase_id = P2.phrase_id AND p1.total = P2.max
WHERE P1.author_name = 'ShakeQuarto'
Seems to work in MSSQL. Get a table with the max for each phrase (P2), then inner join against it using the phrase_id and max to get all phrase rows which are a max, then filter for your author.
posted by markr at 1:21 AM on December 18, 2010
Response by poster: Thanks, everybody. I used the SQL above to grab all the "Jonsonian" phrases, then selected all the rows for those particular ones. So far it seems like I'm getting what I was looking for. I appreciate you keeping me from tearing out what little hair I have left. =)
posted by Mr. Bad Example at 6:37 AM on December 18, 2010
posted by Mr. Bad Example at 6:37 AM on December 18, 2010
« Older Where can I find ridiculously edgy haircut photos... | Help me find a funny video clip from the... Newer »
This thread is closed to new comments.
posted by Orinda at 6:39 PM on December 17, 2010