Fastest way to create search engine for discussion web site?
June 17, 2007 7:41 AM

What's the simplest, easiest, fastest way to create a search engine for a discussion website (in particular, this one) which is NOT RUN by that website, but on another computer, and which recognizes as searchable/sortable fields: 1) the "board" the post is in, 2) the titles as well as the bodies of posts, and 3) the dates the posts were made? Oh, and it should update with new posts every few hours, say.
posted by shivohum to Computers & Internet (4 answers total) 1 user marked this as a favorite
Well, the simple, fast, and easy way is to gain access to the database or a copy of the database for the site to be searched. Can you do that?

If the answer is "no," then the simple and easy way is to write some software to scrape the site and parse the HTML every few hours, and then some more software to query the resulting index.
posted by majick at 8:28 AM on June 17, 2007


Yeah, assuming you don't have access to the DB you'll need to write a robot crawler (not very hard) to visit every page, parse the appropriate HTML, & stick it in your own DB.

In not-even-pseudocode, the process would be a bit like this:

-foreach board (tihs isn't very hard, they're all numbered. at worst you can have a LUT for the ones you want + names)
---foreach thread on board page, going back to the last time searched (slightly harder, since you'll have to actually do a bit of searching in the HTML, but you shouldn't need to actually parse anything)
------visit that thread.
------if new thread, add thread entry in DB with name, date, etc
------foreach post in threads (should get by with a search here too)
----------if new post (just check dates), add post entry to DB, posttable FK to threadtable's PK.

that'll be the bulk of it. it will take a while to run, and may get automatically blocked. you should be sure to set the useragent string to robot, and respect the robots.txt file if it exists.

in terms of searching the DB, the easiest solution would be to use the DB's full text search, if it has it. MS SQL's is good, and I think PgSQL has a pretty good one available. No idea if mySQL has it at all.

Failing that, if its low volume you can go through the text of each post with a stored procedure & try to match it for keywords (even mySQL supports stored procs now, doesn't it?).

If its high volume, you'll probably want to make a separate keywords table. this is where good DB tuning comes in handy. your DB will probably have a particular arrangement that's most efficient in terms of sorting the table, etc, and this is where it actually matters.
posted by devilsbrigade at 8:48 AM on June 17, 2007


There is a search function on that website. You could just construct query urls and scrape results pages?
posted by tmcw at 9:10 AM on June 17, 2007


Aside from the technical how-tos, chowhound is owned by cnet (it was sold by the founders a couple of years ago), so you might want to check into any issues that arise from that, specifically pulling their 'content' without proper credit and linkage and all.
posted by pupdog at 9:47 AM on June 17, 2007


« Older sometimes you feel like a nut   |   Looking for Christmas/holiday family fun in the... Newer »
This thread is closed to new comments.