seeking search engine
March 9, 2006 4:05 PM   Subscribe

why are MSN, Yahoo & Google the only decent search engines nowadays? Are there any search engines made by organizations that don't enable the oppression of billions or aren't conspiring to hurt the Chinese?
posted by sswiller to Computers & Internet (19 answers total) 1 user marked this as a favorite

If a British style pub opened in the US would it have a drinking age of 18? No. It would have a drinking age of 21. Google are only operating under the laws of the country.

However. I use A9 nowadays. Although i belive they get some results thru google.
posted by gergtreble at 4:12 PM on March 9, 2006

alltheweb is often considered an alternative to google. I used them for a while when I was pissed at google about something (removing my site from the results of searching for my sites name) but I gave up.

They're pretty similar in results, IMO.
posted by delmoi at 4:15 PM on March 9, 2006

Thank you.
posted by sswiller at 4:16 PM on March 9, 2006

Why aren't there more? Because search is not an easy thing. You have to:
  • go out and fetch every one of the billions of pages on the internet (without being a pest)
  • search through all of them in a fraction of a second
  • order the results in a way that the best links are on top, where "best" is a "know it when I see it" concept
  • without being misled by all the companies that are working to explicitly game the system
  • and do it all better than Google.
It's really not a very good way for a tech company to spend their money.
posted by smackfu at 4:18 PM on March 9, 2006

Alltheweb (Overture) is owned by Yahoo and is just a web property that queries the Yahoo index.

posted by junesix at 4:32 PM on March 9, 2006

why are MSN, Yahoo & Google the only decent search engines nowadays?

Because they have
a) spent a lot of money
b) a lot of experience with tweaking algorithms, indexing etc.
c) a lot of very clever people working for them.
d) the internet is very very very big.

Also people pretty much stick to what they know, or whatever is the default in their browser. This means there are very high barriers to entry for competitors. But there are still quite a few (see below link).

Are there any search engines made by organizations that don't enable the oppression of billions or aren't conspiring to hurt the Chinese?

See this AskMe
posted by MetaMonkey at 4:39 PM on March 9, 2006

"Barrier to entry" is an understatement.

Imagine what it would take to create from scratch a search engine competitive with Google. You would need millions (if not billions) in hardware, spread out in datacenters globally. You would also need a large pile of cash to pay for all those network connections to move those those terabytes of data necessary to constantly spider those tens of billions of pages.

But even if you had the cash to buy all those raw materials, you would need the software. There is nothing that you can buy off the shelf that would handle this. Google wrote everything themselves. And when you think about the requirements of this software, it is quite staggering. You must accept a query and then search these billions of pages in less than a few seconds, and return the results to the user. And you must do this hundreds or thousands of times per second. The programmer resources required to write something that can pull this off would be non-trivial.

On top of that you can't just do a simple keyword search ala altavista circa 1996. Those results would be junk, easily gamed by the SEO blackhats. No, you have to invent reams of elaborate home grown algorithms and heuristics to try to pull meaningful rankings out of those terabytes of data.

To pay for these massive resources you have to either sell advertisiting or sell higher rankings in the results. In order to make any money selling these things you have to have a significant amount of traffic. In order to have a significant amount of traffic you have to have all of the above -- a full and fresh index of many pages, a fast and comprehensive search, refined and tweaked algorithms to eliminate junk, and so on. Thus a circular dependancy that makes it hard to bootstrap any kind of large search engine.
posted by Rhomboid at 4:54 PM on March 9, 2006

why are MSN, Yahoo & Google the only decent search engines nowadays?

Cuz no one has an algorithm that yields better machine-generated search results than Google's PageRank (everyone else has an engine that works in rougly the same fashion). All the money in the world wouldn't make a difference if my search engine was based on the same principles as PageRank.

However in parts of Asia, user-created "seeded" knowledge systems rule the search world. A Korean firm called Naver pioneered Knowledge Search which has completely swept over Asia in popularity. Think AskMefi meets Wikipedia built around a competitive gaming community where questions are seeded and people compete to supply the best answers for points. With Google, a search for "origami project" would turn up origami craft pages, Microsoft's UMPC website, and some tech blog articles about the device. A search on Naver would yield an extensively researched article carefully written and edited by multiple users about the UMPC device. So while Google's results point you to links with info, Naver directly delivers the info. I've tried it and it's an amazing search tool. If you see Google and Yahoo pouring massive amounts into their "Answers" properties, it's because they've witnessed the power of human-generated content in search engines. Look for Naver+Knowledge Search and clones coming to Europe/North Am shores soon.
posted by junesix at 5:02 PM on March 9, 2006

BW had a short article on Naver (NHN).

The database now has some 37 million questions and answers that can get returned with search results.

Imagine AskMefi with 37M unique questions and well-composed answers - that could be a serious contender to Google.
posted by junesix at 5:06 PM on March 9, 2006

My thinking is that if Google refused to implement the Chinese government's demands, the government would block Chinese people's access to Google completely. It's better to have a crippled Google than no Google at all, right?
posted by lemur at 5:56 PM on March 9, 2006

See also.
posted by hindmost at 5:59 PM on March 9, 2006

Thanks for all the answers.
< !br>
The programmer resources required to write something that can pull this off would be non-trivial.

< !br> That astounds me. It means that the amount of energy required to create a novel startup has dropped since the ninties. I asked because it seems like Yahoo, Goog, Ebay, and Amazon are still the only dotcom stocks worth mentioning on the news.
posted by sswiller at 6:02 PM on March 9, 2006

sorry forgot to hit preview
posted by sswiller at 6:03 PM on March 9, 2006

Search Engine Relationship Chart.
posted by mlis at 6:38 PM on March 9, 2006

i'd like to expand on MetaMonkey's point about the big three search engines having lots of smart people. The big search engines have spent a lot of time and money trying to hire/buy out just about anyone talented in anything related to search engines. Remeber digits of e, the google labs aptitude test, and the fuss over defections?

A lot of clever people have disappeared into those three companies, and they're not saying much (compare papers written by googlers to papers written by people while at google) about what they do there. There was even a spate of articles at one point about techies unwilling to work for startups anymore, due to hiring pressure from search companies (can't find a better link than this ATM).
That's an awfully large technology gap for startups to bridge.
posted by nml at 7:23 PM on March 9, 2006

It's the evolution of the market. There were many search engines, but only the fitest survived.
posted by blue_beetle at 7:59 PM on March 9, 2006

Rhomboid: Imagine what it would take to create from scratch a search engine competitive with Google. You would need millions (if not billions) in hardware, spread out in datacenters globally. You would also need a large pile of cash to pay for all those network connections to move those those terabytes of data necessary to constantly spider those tens of billions of pages.
I can definitely speak to this, although this thread has been well answered already. It is incredibly expensive to build a functional search engine from scratch, although more to the high tens of millions in cost/approaching $100m or so than billions... and while they can be and are fantastically profitable that's only if you get the volume of traffic to support the immense costs. Google and Yahoo did it through starting early and growing word of mouth until they became household names, and MSN has that built-in audience from the many MSN properties (messenger, hotmail, MSN home page, IE browser defaults, etc). Incidentally, MLIS's relationship chart is somewhat out of date.

In addition, as noted, getting the people to do the actually coding and building of a search engine is not easy: you have to have top-notch coders who can make extremely well written code to perform very efficiently (coding algorithms for crawling/serving results that are milliseconds faster will add up quick when doing millions of queries on thousands of machines), top-notch math and linguistics gurus to make better page ranking analysis, brilliant mathematicians to devise incredibly complex algorithms to sell and sort paid ads to put in the sidebar/top of the page a la google/yahoo/msn (they all use some form of keyword bidding to earn their revenue), top-notch development architects as well as operations architects to tie it all together (managing the coding and operating of multi-thousand server farms is not easy, and is not something Joe Programmer- Off- The- Street can do!). It takes hundreds of people to actually do this- it's a fantastically complex problem from all perspectives.

random thoughts:
  • It's not terabytes, it's petabytes of data. 5-8 billion web pages crawled and cached, for example, at an average size of 11-15K per page... plus you have to not only have the cache of documents, you have to have indexes and reverse indexes of words to documents IDs, which are themselves measured in many terabytes in size.
  • This requires in the high 3/low four digit count of servers to hold a single instance of this cache and index, even with lots of hard drive space per machine. Even using cheap, commodity machines like Google does will run several hundred dollars per machine, to the tune of thousands and thousands for any one instance of the index
  • The automation for these systems is really an impressive thing, as any assumption that human beings can manage these servers like most web services is not functional. This also means the service has to be built with a certain over-capacity structure, to allow for machine failures and hardware failures to accumulate over a few weeks' time without impacting the website
  • In addition to that block of servers times N number of redundant instances, you need farms of servers to actually crawl the web, as well as those to index and analyze those pages that are crawled. The vast majority of your servers will always be the index-holding servers, but hundreds/thousands additional servers are needed to make it all work.
  • Crawling the web to keep it relatively fresh is quite a proposition unto itself: imagine several hundred machines downloading constantly, 24/7, with sustained ingress of 680Mb/s or more, downloading 4-5000 documents a second- every single second of every single day- to build a 5-8B doc crawled index that's regularly refreshed and recrawled. The cost of a near gigabit ingress and the hundreds of machines to crawl these results is... not cheap.
  • Each instance of the index is going to have a limit on how many queries it can handle per second, just from disk latencies. While the massively parallel nature of this storage allows for blocks of the index to be simultaneously scanned on different machines at the same time, if you take a Google, Yahoo, or MSN level of traffic you will be dealing with orders of magnitude more queries per second than could ever be handled by one instance of the index. Dozens of instances of these indexes are needed. Ergo, we are literally talking about tens of thousands- or in Google's case where they use even weaker and cheaper individual servers- potentially hundreds of thousands of servers total. Where the heck do you put them all???
  • geolocation and proxying. Companies like Akamai and Savvis and others specialize in cutting down pageload times by TCP aggregating local to the user, and caching the common page elements. The costs of these services for a very high volume site are astronomical, potentially into the 7 figures per month. And if you're a company like Google et al, you'll want to have multiple datacenters so that you are not geographically vulnerable; each one will have to have thousands of servers each.
That's all just a little more detail into the general "Building a functional search engine is so un-freakin'-believably expensive and involved you'd have to be an already huge company to do it". Just to have a search engine that isn't a laughing stock compared to the giants, you'll needed thousands of servers and networking equipmetn, gigabits of traffic, the collective datacenter space and power of a small city, etc. Google and Yahoo started when they had the luxury of indexing a much smaller internet, or when human-managed indexes seemed cutting edge, and grew their complexity and footprint over time. Microsoft obviously has very deep pockets and could throw hundreds of millions at building a self-contained search engine (until a little more than a year ago, MSN was mostly just fronting search results from Inktomi, which was bought by Yahoo along with Overture).

If you can pull it off, though, the profits are immense: where I worked, I heard through the grapevine that the net profit per employee was in the low millions per year.

I actually helped design, build, automate, and eventually ran the Operations for one of these major search engines. Yet I am now finding myself unable to get work, suffering through clueless recruiters who read my resume and then ask "So, do you have production site web experience?" God, I hate my life. I don't mean to whine, but... well, yes, yes I do mean to whine. Sorry. :)
posted by hincandenza at 2:36 AM on March 10, 2006 [2 favorites]

Not to derail, but hincandenza, really? Sounds like it's a self-promotion problem, which is all too common. Email me, since there's no contact info in your profile.
posted by anildash at 3:38 AM on March 10, 2006 [1 favorite]

Wow, there you go... SixApart veep to the rescue.
posted by junesix at 9:35 AM on March 10, 2006

« Older Turning my world upside down   |   ID stolen. How do I get on my plane this Sunday? Newer »
This thread is closed to new comments.