I want to start a search engine...now what?
April 20, 2007 12:41 PM   Subscribe

So I'd like to start a website. More specifically, a search engine. Turns out, I'm pretty clueless when it comes to, well, everything about starting it up...

I understand the basic framework of crawler, index, database query, search interface. What I'd like to be able to do is get a better idea of the overall architecture, what sort of hardware is needed, and how long it'd take to make this happen. This in turn would help me understand the kinds of costs I'm looking at, which in turn helps me develop the beginnings of a business model. I'm not opposed to checking out a consultant, but not sure what type of consultant to contact. Any and all help is much appreciated.
posted by undercoverhuwaaah to Computers & Internet (15 answers total) 1 user marked this as a favorite
 
I think pretty much any capable software developer with some experience could do this. I know that most of the people I've worked with could do it. Maybe consider a jobs posting? Start with a job posting to get someone to help you spec it out and go from there?

(I'm assuming you're not asking for details on the architecture, hardware, costs, etc is this question, because that would be nearly impossible to provide. I'm assuming you're asking where to look or who to look for)
posted by RustyBrooks at 12:49 PM on April 20, 2007


Why do you want to start up a search engine specifically? Have you done ANY competitive analysis on this space? How will you make money with it?

Here's the issue: Google spends bajillions of dollars to hire the most talented mathemeticians in the known world -- people with multiple PhDs who have the option of working somewhere like CERN or a big university ... or Google ... -- and they can't even write a search engine that works reliably. Not saying that you aren't talented or whatever, but if you don't know how to answer the question of what you need to start a web search engine, then you're never going to be able to compete in an already crowded market.

Now, you can see what Google started with, when the web was much MUCH smaller: http://backrub.tjtech.org/May1998/hardware.htm

The big things you need are: a high-performance database that'll be able to index all of that content, the closest to n(0) algorithm for searching that index and returning
posted by SpecialK at 12:52 PM on April 20, 2007


sorry, premature post.

... returning results.

Yes, any talented programmer can write any old search engine, but ... well, the challenge is to provide relevant results. And that's not something that many developers can do, because the discreet math (aka 'formal logic') to do that in an efficient and productive manner is above the head of most people who don't have a PhD in computer science or a related mathematics field.
posted by SpecialK at 12:55 PM on April 20, 2007


but ... well, the challenge is to provide relevant results.
That's not really what's being asked here, though. The Q wasn't for the best search engine, just a search engine. One of the sleaziest guys I know has gotten rich off selling rankings on a no-mark search site he wrote. It's amazing what gets traffic.
posted by bonaldi at 12:58 PM on April 20, 2007


What's the scope of this? Are you thinking of a general-purpose web search like yahoo or google, or is this a domain-specific idea with a smaller scope?

What you intend to index and who you intend to serve with it will have a profound affect on your architectural needs.
posted by cortex at 1:01 PM on April 20, 2007


Writing a good search engine is hard. Unless you are searching a very small dataset, it's nothing like a typical web-application (discussion forum, blog, CMS, etc.). You need to have some clever ways of indexing, otherwise it will be slow.

I actually tried to write one once. What killed me was searching for phrases. e.g. one test case was searching Hamlet for "to be or not to be". Not easy. If you don't need it to be scalable, then it is easy. You could just use the built-in text-indexing in the database, or just scan through the whole database, like grep.

Have a look at some opensource projects to get a sense of the issues, e.g. htdig.

Also, post more info if you want more useful answers.
posted by kamelhoecker at 1:13 PM on April 20, 2007


Response by poster: I'm thinking vertical search. Industry-specific. I'd assume it makes things a bit easier as far as getting relevant results. I realize there are huge players like the googles and the yahoos, but there also is a growing market to help people find specific information that would otherwise get bogged down in the 20+ pages of results those big guys offer. I realize the question wasn't as clear as I had hoped: I would like any resources on the web that would expand my understanding, but also suggestions on who to talk to offline to help put a more formal spec on paper (and in turn see costs, etc.)
posted by undercoverhuwaaah at 1:13 PM on April 20, 2007


Instead of assuming the details, you may be interested to read the decent overview of the concept at Wikipedia:

http://en.wikipedia.org/wiki/Search_engine
posted by rhizome at 1:21 PM on April 20, 2007


You should really check out Nutch.

They have a crawler, a distributed filesystem for storing all the crawl data, a distributed computation framework for analyzing all that data with their indexer and ranker, and a front end for actually querying the index. I'm sure any developer could write a "search engine," I doubt just any developer could bang out something like Nutch.

What do you hope to achieve with this search engine? Google and others can sell you vertical search as a service that you can monetize on your own.
posted by Good Brain at 1:36 PM on April 20, 2007 [1 favorite]


Your question is so open ended and basic thats its sort of like asking "I'd like to build a car. How do I do that?"

That said, here are two books you should read. Of course, they both assume that you have some programming knowledge (perl, php, ruby, whatever) and that you can get around linux. If whatever consultant you hire doesn't at least know the principles in those two books, your search engine will fall down if it gets any sort of traffic.

Like others said, search is hard...even if you are focusing on a limited vertical.
posted by rsanheim at 1:38 PM on April 20, 2007


To elaborate, I wasn't really implying that any developer could write a good search engine. I think any good developer could write a *decent* one though. But more than anything I was implying that any good developer could get you started with architecture, hardware, etc.

Search is a little like direction-finding-on-maps, or language translation. Things that are pretty easy to do 80-90% of the way there, but damn near impossible to do 100%. For some things, that's OK.

If the poster's idea is sufficiently interesting in this rather crowded search space, then an 80% solution will probably be good enough to garner attention and eventually money and talent. If it isn't, then it won't really matter if his search engine is top notch or not. I've seen way to many first-order-prototypes become actual products to worry prematurely about getting a badass result up front.
posted by RustyBrooks at 1:53 PM on April 20, 2007


Also, I may have to accept the fact that most of the people I've worked with over the years have been well above average, and that's definitely possible.
posted by RustyBrooks at 1:54 PM on April 20, 2007


Here's the issue: Google spends bajillions of dollars to hire the most talented mathemeticians in the known world -- people with multiple PhDs who have the option of working somewhere like CERN or a big university ... or Google ... -- and they can't even write a search engine that works reliably.

But what about the unkown world, hmm?

Like everything else there are a bunch of open source tools to do this just try out a couple and stick on some front-end code.
posted by delmoi at 2:05 PM on April 20, 2007


"Why Writing Your Own Search Engine is Hard" by Anna Patterson (from Stanford/Internet Archive/Google)
posted by b. at 7:30 PM on April 20, 2007 [1 favorite]


The fastest way to get a vertical search engine going is probably Google Coop. I suspect it will be hard to get better results without a couple of years experience with search.
posted by dhoe at 12:32 AM on April 21, 2007


« Older Former GF, now friend, and now I wonder if she...   |   My objects are melting on the Rails Newer »
This thread is closed to new comments.