MySQL, Help me search tags!
September 19, 2006 1:39 PM   Subscribe

MySQL Help! I am trying to use tags on my web site. I am trying to allow users to search for tags. I want the users to do a search like "castle+moat-ducks" and return all pages that have the tags castle and moat but DO NOT have the tag duck. How can I do this? There is a single tag table, which contains 3 fields, tag_id, image_id and tag, which holds the actual text of the tag. Please help me set up the proper query! Note that I am using PHP.
posted by LoopyG to Computers & Internet (29 answers total) 2 users marked this as a favorite
 
how about...

SELECT * FROM tags WHERE (tag='castle' OR tag='moat') AND tag!='ducks';
posted by grumblebee at 2:01 PM on September 19, 2006


So you've been putting in a separate row for each tag, right? (ie. if a post has three tags, it gets three rows). If so, it's really simple. If not, you're in a little trouble.
posted by reklaw at 2:02 PM on September 19, 2006


Response by poster: Reklaw. That is correct. Each tag gets its own row, so a single image can have any number of rows. I tried the query given by grumblebee, didnt work. Also, I have been using SELECT DISTINCT * to prevent getting repeats of a single image.
posted by LoopyG at 2:05 PM on September 19, 2006


Try

SELECT * FROM tags WHERE (tag='castle' AND tag='moat') AND tag!='ducks';


grumblebee's sql would allow castle+ducks or moat+ducks
posted by MikeKD at 2:11 PM on September 19, 2006 [1 favorite]


What do you mean Grumblebee's solution doesn't work? (Except it should be tag='castle' AND tag='moat')
posted by bitdamaged at 2:13 PM on September 19, 2006


doh!
posted by bitdamaged at 2:13 PM on September 19, 2006


I tried the query given by grumblebee, didnt work.
aside from quibbling over where you actually wanted tag='castle' AND tag='moat', what grumblebee suggested sounds pretty straightforward. how does it not work?
argh. jinx.
posted by juv3nal at 2:13 PM on September 19, 2006


Response by poster: MikeKD. That doesnt work because I have each tag in its own row. So for image 93, it has tags "rock" and "bear", and this takes up two rows in the tag table.

So doing a search for
SELECT * FROM tags WHERE tag="rock" AND tag!="bear"
would return 1 result, that would be the row with imageid 93 and tag="rock".
posted by LoopyG at 2:17 PM on September 19, 2006


Response by poster: juv3nal, grumblebees query doesnt work for the same reason. It still returns imageids that contain one of the tags in the "good" category, even if it contains one of the tags in the "bad" category.
posted by LoopyG at 2:18 PM on September 19, 2006


I'm not sure how Grumblebee's query would work. The clause WHERE (tag='castle' AND tag='moat') can never evaluate to true.
posted by justkevin at 2:20 PM on September 19, 2006


The clause WHERE (tag='castle' AND tag='moat') can never evaluate to true.

Which is why I suggested OR
posted by grumblebee at 2:22 PM on September 19, 2006


SELECT * FROM tags WHERE (tag='castle' AND tag='moat') AND tag!='ducks'

This can't possibly work. Every row that has tag='castle' can't have tag='moat'. You'll get no rows back.

You need nested queries for each tag you're trying to match or unmatch, e.g.
select image_id from tags
where tag='castle' and image_id in (
  select image_id from tags 
  where tag='moat' and image_id in (
    select image_id from tags
    where tag != 'ducks'
  )
)
Large nested queries will present their own problems, but for small numbers of tags you should be OK. Also you could create an index of how frequently a tag occurs so that you can structure you nested query with least-frequent tag occurring first (the innermost query) for better performance. And of course your tag table should be indexed.
posted by ldenneau at 2:24 PM on September 19, 2006


Response by poster: justkevin, exactly, so I use OR. Later I implement the AND operator by only displaying results that appear X times, where X is the number of tags being ANDed.

So anyone have an idea for how to do this for the case of a single table with 1 tag per row?
posted by LoopyG at 2:24 PM on September 19, 2006


Best answer: The query grumblebee gives doesn't work because it's only returning images tagged with either 'moat' or 'castle', and ignoring the 'not duck' requirement entirely, since if a row in the tags table has tag IN ('moat', 'castle'), it can't also have tag = 'duck', so if the first part of the WHERE clause evaluates as true, the second part does, too.

You can do this with subqueries if your version of MySQL supports it; there might be a better way to accomplish it, but I'm drawing a blank at the moment. The upshot of this method is that it's really easy to build dynamically in your app, which I'm assuming you'll need to do.

SELECT image_id FROM tags a WHERE
EXISTS (SELECT 1 FROM tags b WHERE b.image_id = a.image_id AND tag = 'moat')
AND EXISTS (SELECT 1 FROM tags b WHERE b.image_id = a.image_id AND tag = 'castle')
AND NOT EXISTS (SELECT 1 FROM tags b WHERE b.image_id = a.image_id AND tag = 'duck')


(Ignoring the fact that you might want to consider having a tags table holding only tag_id and tag, and an images_tags table mapping image_id to tag_id)
posted by uncleozzy at 2:25 PM on September 19, 2006


It's not as simple as you guys are trying. If you have a version of MySQL that supports sub-selects, I think this should work (though I totally haven't tested it):

SELECT image_id FROM tags WHERE tag = 'castle' AND image_id IN (SELECT image_id FROM tags WHERE tag = 'moat') AND image_id NOT IN (SELECT image_id FROM tags WHERE tag = 'ducks')

This effectively does three passes over the tags table so it's not really much better than doing this in your application, which would probably be easier than trying to construct a query like this dynamically.
posted by Khalad at 2:28 PM on September 19, 2006


The query grumblebee gives doesn't work because it's only returning images tagged with either 'moat' or 'castle', and ignoring the 'not duck' requirement entirely...

Thanks. Makes sense.
posted by grumblebee at 2:29 PM on September 19, 2006


What if you reversed the order in my WHERE clause?

WHERE tag!='ducks' AND tag='castle' OR tag='moat');

Wouldn't that eliminate any ducks, but -- if it found a non duck -- allow it if it was a castle or a moat?
posted by grumblebee at 2:31 PM on September 19, 2006


ah. of course. *smacks self*
posted by juv3nal at 2:32 PM on September 19, 2006


To elaborate, grumblebee's one-pass approach would work if you had all of the tags for each image stored in one row (which is bad normalization) -- if there were a column containing space-separated tags, for example. Then you would be able to do something like:

SELECT image_id FROM tags WHERE tags LIKE '% castle %' AND tags LIKE '% moat %' AND tags NOT LIKE '% ducks %'

Do you see now why you can't do this without sub-queries or perhaps some complex self-joining, given the properly normalized table from the question?
posted by Khalad at 2:33 PM on September 19, 2006


Response by poster: Hey uncleozzy. I just tried that out on a few sample queries and it seems to work exactly how I want!

I figured it would have to be done with sub-queries, but they were a new idea to me (havent dealt with DBs in awhile) and couldnt figure out how to get that to work.

I am assuming I can simply change it to "OR EXISTS" if I want the OR operator to work?
posted by LoopyG at 2:35 PM on September 19, 2006


grumblebee: Sorry, apparently I was critiquing MikeKD's query. But yours suffers from the problem pointed out by uncleozzy.
posted by justkevin at 2:36 PM on September 19, 2006


Oh yeah, I see the problem: he wants a list of IMAGES. He wants images tagged with MOAT CASTLE and MOAT CASTLE KING -- but NOT images tagged with MOAT CASTLE DUCK.

And the DB looks like this

1 | moat | 3
2 | castle | 1
3 | castle | 2
4 | duck | 3
5 | king | 2
6 | moat | 2

(The third column is the image id.)

So if you did what I suggested...

SELECT image_id FROM tags WHERE tag !=duck AND (tag !='castle' OR tag='moat');

... you'd get the following results

1 | moat | 3 -- YES (not duck, yes moat)
2 | castle | 1 -- YES (not duck, yes castle)
3 | castle | 2 -- YES (not duck, yes castle)
4 | duck | 3 -- NO (duck)
5 | king | 2 -- NO (not moat or castle)
6 | moat | 2 -- YES (not duck, yes moat)

image_id 3 would be returned, even though it has DUCK as a tag.
posted by grumblebee at 2:44 PM on September 19, 2006


Response by poster: grumblebee, that is exactly right. Sorry I did not clarify that further.

Looking at this code a bit more, I see that it is based around an OR/ORformat.

Meaning, it returns any images which contain AND of the "good" (inclusive) tags, and that contains NONE of the "bad" tags.

IE:
rocks+ducks-hamsters-pigeons
would return anything that had
((rocks OR ducks) AND NOT (hamsters OR pigeons))
How would I go about making this AND/OR, in other words
rocks+ducks-hamsters-pigions
would return anything that had
((rocks AND ducks) AND NOT (hamsters OR pigeons))
?
posted by LoopyG at 2:52 PM on September 19, 2006


Are you sure? It looks like what ldenneau, uncleozzy, and I wrote all do ANDs, i.e.:

rocks AND ducks AND NOT hampsters AND NOT pigeons
posted by Khalad at 4:19 PM on September 19, 2006


Second the suggestion of tables like Tags(tag_id,tag) and ImageTags(image_id,tag_id). That will save storage, reducing IO and RAM requirements.

If we're not talking Flickr here, you could consider selecting image_id's with the tags you want, and then performing a set substraction against those image_id's with the tags you don't want (array_diff($wanted,$not_wanted[,..]) in PHP iirc). Naive, but should be fine with result sets in the thousands.
posted by Freaky at 4:25 PM on September 19, 2006


For what it's worth, I believe that:

* IN will be faster than EXISTS. The EXISTS query does four passes over the tags table, while the IN ones do three.
* Using NOT IN will be more efficient than !=, using the reasonable assumption that most images will not have any given tag (meaning a sub-query using "=" will yield a smaller result set than one using "!=").
posted by Khalad at 4:26 PM on September 19, 2006


SELECT p.pageid
FROM jt_page_tag AS pt
INNER JOIN t_page AS p USING (pageid)
INNER JOIN t_tag AS t USING (tagid)
WHERE t.tag IN('castle','moat') AND t.tag NOT IN('duck')
GROUP BY p.pageid
HAVING COUNT(*) = 2;

Untested.

Assumes three tables - t_page, t_tag, and a joining table, jt_page_tag. There is a many-to-many relationship between pages and tags, and the primary key of (and only contents of) jt_page_tag is the tuple (pageid, tagid).

The HAVING() at the end is optional, and limits the results to those that match ONLY the tags 'castle' and 'moat' (not ('castle', 'moat', 'drawbridge')). In this particular case it makes the "NOT IN('duck')" clause superfluous, but it's so useful (and I had to go to the MySQL mailing lists to get it) I thought I'd throw it in there anyway. You can always ORDER BY the COUNT(*) instead, which in theory will place the most specific matches (those that have only a few tags) at the top.

But to be honest, while this more-or-less works, this is really a case where you should be unnormalising your data (I've been bitten by trying to do exactly this in SQL, and this approach quickly gets slow as the tables grow).

Given the table structure above, create another table, u_page_tag. u_page_tag is identical to t_page, but has an additional column, tagids VARCHAR(255), which contains tagids, space delimited. Now you can do a search for

LIKE '% 123 %' AND LIKE '% 234 %' AND NOT LIKE '% 345 %'

as Khalad suggested. The reason I use tagids, not the tags themselves, is so you can have tags with spaces in, and because (I hope) tagids are shorter than tags. (Remember that the first and last chars of tagids should be a space, to catch "LIKE '% 123 %'" when 123 is the first tag in tagids). If you use the tags themselves, rather than the tagids, it may be faster to use MySQL's full text search (which comes with its own set of issues).

Once you have this table, write triggers on your original tables (MySQL >5.0.2), so that when you do an INSERT/UPDATE/DELETE on them, an analogous operation is carried out on u_page_tag to keep it in sync (hint: it's easier to rebuild tagids completely than try to remove a tag from it). This means that you can do your INSERT/UPDATEs on the nice, normalised data, but do your SELECTs on the fast, messy table.

If you don't have triggers, you'll have to simulate in PHP.

Wow. You can tell I've been doing this recently, right?
posted by Leon at 12:10 AM on September 20, 2006


Response by poster: Wow Leon. Thanks for that response. I am going to go with the single normalized table for now and see how far I can get with it. But its good to know that if it ever grows unruly I have your response to help me re-structure the tag tables.
posted by LoopyG at 10:07 AM on September 20, 2006


To be honest, I think you'd be better off getting everything that has castle or moat with one simple query, and then getting everything that has duck with another. Stick the id numbers you get back into two arrays, and remove any from the first array that appear in the second. Doing it in MySQL itself strikes me as unnecessarily complex.
posted by reklaw at 12:24 PM on September 20, 2006


« Older Green home products   |   ps command not working Newer »
This thread is closed to new comments.