How does flickr do it?
February 4, 2009 6:18 PM
Subscribe
My developer and I want to create a database schema that allows us to search through hundreds of tags applied to thousands of records. So far our benchmarks suck. Are we expecting too much?
Here's the deal (and I'm not the database guy, so bear with me if I mess up the nomenclature): We have a bunch of records and want to apply tags to them. We're testing a database schema with dummy data including 100,000 records and 750 tags, with an average tag-to-record count of 100. This doesn't perfectly approximate our actual data set, but it's close enough.
So we have Table 1, which lists the tag ids, Table 2, which lists the record ids, and a join table, which is a two-column table listing a tag id in one column and a record id in the other wherever that tag is associated with that record.
For example, if tag223 is associated with record12, record6994, and record90001, the join table looks like this:
tag223 | record12
tag223 | record6994
tag223 | record90001
(Given the numbers above, the actual join table has around 10,000,000 rows.)
This seemed to make sense when he explained it to me, but the benchmarks with the dummy data, even indexed, were ridiculously slow. This was just one a desktop machine, but it was several orders of magnitude slower than we want it to be.
Is it possible to do tag-based searching on this scale with decent performance? Are we missing something simple?
We're using Postgres. The final product is going to be extremely read-heavy (i.e., once the data's in it'll be only minor updates). If I've left anything out, please ask.
posted by hayvac to computers & internet (17 comments total)
8 users marked this as a favorite
1. Are any of your joins on UTF8 varchar columns? UTF8 string operations are very expensive, try to avoid them.
2. Are you using subqueries or views? If so, you are losing your indexes. Try to do outer joins instead
3. Use your database's explain tool. For MySQL try describe
posted by b1tr0t at 6:29 PM on February 4