I Want To Build My Own Google
April 2, 2015 6:21 AM   Subscribe

What I'd like to do is crawl a list of 5000 to maybe 10,000 max URLs on a weekly basis, searching for a couple of specific terms. I don't see a way to set that up with Google Custom Search. I have access to server space and am comfortable enough with PHP or Python to configure a dB and script if needed. There are countless options out there. Anybody have specific experience to point me in the right direction?
posted by COD to Computers & Internet (9 answers total) 10 users marked this as a favorite
 
Personally, I think Python is more suited to the problem you're describing here. I'll list a few technologies I know of that might help you solve the problem.

Apache Lucene & Solr: A library and server specifically built for storing indexes for searching documents.
Scrapy: A python library designed for writing web scrapers.
Selenium: A python (and others) library for communicating and driving a webbrowser. Could be used to more easily scrape pages, as the browser will evaluate the javascript on the page more reliably.
posted by Axle at 6:31 AM on April 2, 2015


I had a very similar kind and scale of searching task, and I did it with handwritten Python code and a sqlite database. It turned out to be harder than I anticipated -- there were lots of weird situations I needed heuristics for, like making sure the crawler wasn't following links that *did* anything, and dealing with automatically generated URLs that were non-unique pointers to the same content.

If I had it to do over again I'd try nutch and solr.
posted by xris at 6:33 AM on April 2, 2015


Seconding a combination of what's already been said here. Nutch for the web scraping, lucene/solr for searching your generated index. Don't write it yourself--this is a solved problem.
posted by rachelpapers at 6:40 AM on April 2, 2015


And in fact Nutch has a tutorial to do pretty much exactly what you want to do here: https://wiki.apache.org/nutch/NutchTutorial
posted by rachelpapers at 6:42 AM on April 2, 2015


Be very careful. The first "web crawler" software like this wreaked havoc on websites. Don't reinvent the wheel. Read some of the history before you write your own software.

http://en.wikipedia.org/wiki/Web_crawler
posted by intermod at 9:47 AM on April 2, 2015


You might also look into using ElasticSearch for the search backend. It's really easy to set up and use.
posted by willF at 9:52 AM on April 2, 2015


I don't see a way to set that up with Google Custom Search.

Google Alerts does a fair job of this. It's somewhat customizable: you can choose schedule, geo region, content type (blog, news, forum, video, etc) and other criteria. The only thing you cannot do is limit it to specific URLs but perhaps your search term in combination with those other criteria will narrow it down enough for you?

If that doesn't work for you, IFTTT.
posted by rada at 1:29 PM on April 2, 2015


Also beware: the first web crawler I made (see my previous questions) used up our whole monthly download allowance in a couple of days. (Because I was running it over and over while testing it.)
posted by lollusc at 4:29 PM on April 2, 2015


Hadoop+HBase+Solr+Nutch... Beware: It's a very version sensitive setup; there are often compatibility issues between some of the versions of of these components.
posted by Stu-Pendous at 8:24 PM on April 4, 2015


« Older 10th birthday party, fantasy theme: what fun ideas...   |   Intermittent Fasting Newer »
This thread is closed to new comments.