How does flickr do it?
February 4, 2009 6:18 PM   Subscribe

My developer and I want to create a database schema that allows us to search through hundreds of tags applied to thousands of records. So far our benchmarks suck. Are we expecting too much?

Here's the deal (and I'm not the database guy, so bear with me if I mess up the nomenclature): We have a bunch of records and want to apply tags to them. We're testing a database schema with dummy data including 100,000 records and 750 tags, with an average tag-to-record count of 100. This doesn't perfectly approximate our actual data set, but it's close enough.

So we have Table 1, which lists the tag ids, Table 2, which lists the record ids, and a join table, which is a two-column table listing a tag id in one column and a record id in the other wherever that tag is associated with that record.

For example, if tag223 is associated with record12, record6994, and record90001, the join table looks like this:

tag223 | record12
tag223 | record6994
tag223 | record90001


(Given the numbers above, the actual join table has around 10,000,000 rows.)

This seemed to make sense when he explained it to me, but the benchmarks with the dummy data, even indexed, were ridiculously slow. This was just one a desktop machine, but it was several orders of magnitude slower than we want it to be.

Is it possible to do tag-based searching on this scale with decent performance? Are we missing something simple?

We're using Postgres. The final product is going to be extremely read-heavy (i.e., once the data's in it'll be only minor updates). If I've left anything out, please ask.
posted by hayvac to Computers & Internet (15 answers total) 7 users marked this as a favorite
 
b1tr0t: It's EXPLAIN in Postgres as well.

UTF8 shouldn't be slow if you can avoid having to interpret it during queries (store in a normal form, don't do case-insensitive searches, don't use an index that requires treating it as anything but a sequence of bytes).

How slow is slow, anyway?
posted by hattifattener at 6:40 PM on February 4, 2009


How do websites like this one and various other forum software handle it?

Is the database install somehow not configured right?

rant- why haven't tree style databases and/or data structures of data structures (like in C) ever taken hold? EVERY damn time I want to structure data, I have to figure a way how to unstructure it to fit in flat tables. Blarrg.
posted by gjc at 6:50 PM on February 4, 2009


Response by poster: How slow is slow, anyway?

I didn't know if this mattered since I can't tell you much about the testing environment, but a representative search involved a search for all the records associated with a particular tag. It took 111 seconds.
posted by hayvac at 6:51 PM on February 4, 2009


How about posting your query and your actual table defintions. If your query is taking 111 seconds, my gut tells me that you're either missing an index, or perhaps introducing a cartesian join.

I'm a DB guy, by trade, and would be happy to take a closer look.
posted by neilkod at 7:01 PM on February 4, 2009


Response by poster: How about posting your query and your actual table defintions.

I don't have access to the database itself so I can't tell you this. (I'd get DB guy on here, but he's not a member.) I'm most interested knowing whether the idea itself is flawed before we start digging through the implementation. I was thinking that maybe Postgres is a horrible solution for this, or we've chosen a crappy schema.

Suggestions from the gut are more than welcome.
posted by hayvac at 7:05 PM on February 4, 2009


Dude. No modern search engines have database backends for exactly this reason.

Look into compiled binary indexers. (like lucene or swish-e).


You are not the first person to ask this by a lot.
posted by judge.mentok.the.mindtaker at 7:07 PM on February 4, 2009


Forgot to link
lucene
swish-e
posted by judge.mentok.the.mindtaker at 7:08 PM on February 4, 2009


Something else is wrong; Postgres is fine, and joining the join table to table 2 should be instantaneous if they're both indexed. What do you get for EXPLAIN SELECT * FROM join_table JOIN table_2 USING (record_id)?
posted by nicwolff at 7:11 PM on February 4, 2009


er, add a where clause to that: EXPLAIN SELECT * FROM join_table JOIN table_2 USING (record_id) WHERE tag_id = 123
posted by nicwolff at 7:15 PM on February 4, 2009


I work on a site that has 3,534,113 records in the join table and it has acceptable performance. I also use Postgres. My opinions:

- Make sure you're using version 8.3. Version 8 was a performance release and has many important changes.
- Make sure you've tweaked the configuration file. Give Postgres as much memory as possible. I gave mine a gig. Run ANALYZE to give the query planner some intelligent estimates.
- If you want to stick with the database, you're eventually going to want to partition. Partitioning is how sites like Flickr can scale. If users can only search for one tag at a time you have a huge advantage because you can split your join table into ten join tables and your app can just figure out which one to use based on which tag it is. But if you're allowing multiple tags, this becomes a lot harder. You will have to handle the joins in your app, and this gets messy.
- If users are hitting a small percentage of your tags, then cache. This is especially advantageous for you since you don't get many writes. You can even prime the cache in an overnight batch.
- By all means try denormalizing your tags into a text field and use Postgres's full text search module to see if you get better performance. This has its own drawbacks however.
posted by rq at 7:50 PM on February 4, 2009


How many tags do you have in your dummy setup? Are there enough that no one tag is more than, say, 10% of the tags?

The reason I ask is that if there aren't many different values for a column in a particular table, Postgres will not use the index on that column. Indexes only speed things up if what we call "cardinality" is high.

You can test this by doing an EXPLAIN and seeing whether Postgres is using the index or doing a full table scan.
posted by i_am_joe's_spleen at 8:06 PM on February 4, 2009


Argh, I apologise for not reading your question closely. Anyway, EXPLAIN is the first step here, as others have suggested.
posted by i_am_joe's_spleen at 8:08 PM on February 4, 2009


I'm used to MySQL but make sure that the database field types used in joins are exactly the same. Same int size, same signed/unsigned values.. everything. Otherwise the engine has to do work on each row before making the comparison. Also make sure that the fields you are joining are indexed..
posted by xorry at 8:12 PM on February 4, 2009


Also, now that I have read your question more closely, I work with a Postgres database which has 100 times more rows in important tables than your dummy data, and far more complex queries running all the time, and it's got writes as much as reads, and 111 second queries only happen with very complex aggregates and multitable joins. The scenario you're talking about should always be sub-second.

It is true, as other posters have said, that a normalised relational database may not scale to Flickr levels; that may well need something more primitive, partitioned, and redundant. That's not your problem right now though.

Consider that there appear to be more than 1.4 million comments in Metafilter, judging by id number, and 10s of thousands of threads, tagged up the wazoo, and yet you get a personalised page out of the relational database pretty damned quick.
posted by i_am_joe's_spleen at 1:47 AM on February 5, 2009


Maybe you need a new database guy. Postgres takes work to get right (compared to MySQL with MyISAM). Unfortunately I don't have enough experience to offer specific suggestions.
posted by PueExMachina at 5:16 PM on February 5, 2009


« Older Help with financial aid!   |   Sing a Song of Amsterdam Newer »
This thread is closed to new comments.