What's the best way to set up a tagging system?
April 24, 2006 10:46 PM   Subscribe

Database experts: I'm setting up a tagging system for a site I run. What's the most efficient way to go about it?

As part of the next version of the mp3 hosting site I run, I want to give users the ability to tag each mp3 they upload with various keywords that people can search later on down the track, much like we can do here with threads. I'd like to know what the general consensus is on the best method of setting something like this up, and how best to implement it.

I thought of making a table, say 'tags', and in that have one new record for each tag a file has. For instance, a user uploads an mp3 and the site assigns it the id '12'. The user selects, from a predetermined list, some tags that best describe the file he's uploading: 'comedy', 'diy', and 'modern french literature'. Three new records are created in the 'tags' table which correspond to those three tags, ie:


id | tagid | fileid
-------------------
1 comedy 12
2 diy 12
3 mfl 12

As more files are added and tagged, records are created for those files too, and eventually we have a 'tag' table with a trillion records which we can interrogate at our leisure.

Is this the best / most efficient way to do this?
posted by Savvas to Computers & Internet (11 answers total)
 
Presuming fileid is a foreign key from some table with details about the file, it has possibilities.
posted by paulsc at 11:04 PM on April 24, 2006


No, because you'd be duplicating tag names in that table. Normally, you'd have a table of tags and a "join table" that associated the tags with the files:

create table files (
  file_id serial primary key,
  file_name varchar(128),
  file bytea -- or whatever your DB calls it
);

create table tags (
  tag_id serial primary key,
  tag_name varchar(128)
);

create table file_has_tag (
  file_id integer references files on delete cascade,
  tag_id integer references tags on delete cascade
);


The "tags" table becomes your "predetermined list" of tags. To get all the files with some tag, you do, for example:

SELECT files.* FROM files JOIN file_has_tag USING (file_id) JOIN tags USING (tag_id) WHERE tag_name = 'comedy';

You'll want to index the file_has_tag table on both file_id and tag_id.
posted by nicwolff at 11:24 PM on April 24, 2006


Keep in mind that depending on the database, a JOIN can be more expensive than alternate ways.

Some databases in particular have full-text search capabilities that will make their own keyword tables. If your database has this, it would be worth seeing if this was faster than the JOIN & corrisponding SQL.
posted by devilsbrigade at 11:30 PM on April 24, 2006


This similar AskMe on setting up a tagging system may be handy.
posted by MetaMonkey at 11:32 PM on April 24, 2006


Go with 2 tables as you originally outlined.

With 1 table and a Tag field containing a delimited list of tags, your queries will either be slow, or just look weird. You could make it work, but it probably isn't a great idea.

With 3 tables you're basically saying that every time a user adds a file and tags it, your process is going to lookup the tag in the Tags table to see if it already exists. If it exists, it will create a record in the association table using the key of the existing Tag record and the new tagged-item. If it doesn't exist, the process'll insert the tag into the Tags table and create a record in the association table using the key of the newly inserted tag. (I hope that makes some sense).

Anyhow, the main difference between 2 tables and 3 tables is one of performance vs. storage space (respectively). 2 tables will have slightly better performance overall and 3 tables slightly better storage. I say slightly because "Tag" data is really quite insignificant (varchar(50) perhaps?) and generally won't affect either parameter.

One thing to consider is that a Tag's meaning is the Tag itself. There isn't any other data really associated to the tag. Unless you see a point in having each Tag in the system defined as a unique entity, then there's no point for the 3rd table other than satisfying certain aspects of academic correctness.
posted by C.Batt at 12:06 AM on April 25, 2006


GAH. Okay, I skipped the part where you mentioned "pre-determined list" of tags. If you want present a pick-list to the user, go with 3 tables as it'll ease the creation of the list. If you want to let them assign tags "free form" go with 2 tables.

/ d'oh!
posted by C.Batt at 12:15 AM on April 25, 2006


Rather than having the tags in a database of their own, I have them held in an array within the script itself. When I want to extract information from the tag table for a particular file, I'm doing this:

select * from tags where fileid=$id_of_file;

and then just parsing the information that gets returned.

I don't know anything about joins or references or anything like that, my SQL knowledge is relatively basic and limited to putting stuff in and getting stuff out, so I have no idea if this is a more efficient way of doing things.

I'm intrigued by the meaning of "on delete cascade" that you mention in your post, nicwolff. Does this imply that when a particular record gets removed from the file table, its associated tags are removed automatically also? If so, that's fucking great and I need to do that.
posted by Savvas at 12:49 AM on April 25, 2006


Oops, I meant tags in a table, not database, of their own.
posted by Savvas at 12:49 AM on April 25, 2006


IIRC, it means that if a record is removed from the file table (or the tags table, because cascade deletion is set for both), the records in the file_has_tags table that reference it also get removed.

You can still have files with no tags, or tags with no files, but you won't get stale records in your file_has_tags table that point to nonexistent files or tags.
posted by flabdablet at 7:45 AM on April 25, 2006


It's worth pointing out that if you have a predetermined list, what you're doing isn't really "tags" -- it's categories, or sections. The whole point of tags is that they are freeform.
posted by reklaw at 10:35 AM on April 25, 2006


Yeah but if you don't tell them then I won't, reklaw.
posted by Savvas at 8:04 PM on April 25, 2006


« Older Who authored these sci-fi short story anthologies?   |   Logistics & College Newer »
This thread is closed to new comments.