I Want To Build My Own Google
April 2, 2015 6:21 AM Subscribe
What I'd like to do is crawl a list of 5000 to maybe 10,000 max URLs on a weekly basis, searching for a couple of specific terms. I don't see a way to set that up with Google Custom Search. I have access to server space and am comfortable enough with PHP or Python to configure a dB and script if needed. There are countless options out there. Anybody have specific experience to point me in the right direction?
I had a very similar kind and scale of searching task, and I did it with handwritten Python code and a sqlite database. It turned out to be harder than I anticipated -- there were lots of weird situations I needed heuristics for, like making sure the crawler wasn't following links that *did* anything, and dealing with automatically generated URLs that were non-unique pointers to the same content.
If I had it to do over again I'd try nutch and solr.
posted by xris at 6:33 AM on April 2, 2015
If I had it to do over again I'd try nutch and solr.
posted by xris at 6:33 AM on April 2, 2015
Seconding a combination of what's already been said here. Nutch for the web scraping, lucene/solr for searching your generated index. Don't write it yourself--this is a solved problem.
posted by rachelpapers at 6:40 AM on April 2, 2015
posted by rachelpapers at 6:40 AM on April 2, 2015
And in fact Nutch has a tutorial to do pretty much exactly what you want to do here: https://wiki.apache.org/nutch/NutchTutorial
posted by rachelpapers at 6:42 AM on April 2, 2015
posted by rachelpapers at 6:42 AM on April 2, 2015
Be very careful. The first "web crawler" software like this wreaked havoc on websites. Don't reinvent the wheel. Read some of the history before you write your own software.
http://en.wikipedia.org/wiki/Web_crawler
posted by intermod at 9:47 AM on April 2, 2015
http://en.wikipedia.org/wiki/Web_crawler
posted by intermod at 9:47 AM on April 2, 2015
You might also look into using ElasticSearch for the search backend. It's really easy to set up and use.
posted by willF at 9:52 AM on April 2, 2015
posted by willF at 9:52 AM on April 2, 2015
I don't see a way to set that up with Google Custom Search.
Google Alerts does a fair job of this. It's somewhat customizable: you can choose schedule, geo region, content type (blog, news, forum, video, etc) and other criteria. The only thing you cannot do is limit it to specific URLs but perhaps your search term in combination with those other criteria will narrow it down enough for you?
If that doesn't work for you, IFTTT.
posted by rada at 1:29 PM on April 2, 2015
Google Alerts does a fair job of this. It's somewhat customizable: you can choose schedule, geo region, content type (blog, news, forum, video, etc) and other criteria. The only thing you cannot do is limit it to specific URLs but perhaps your search term in combination with those other criteria will narrow it down enough for you?
If that doesn't work for you, IFTTT.
posted by rada at 1:29 PM on April 2, 2015
Also beware: the first web crawler I made (see my previous questions) used up our whole monthly download allowance in a couple of days. (Because I was running it over and over while testing it.)
posted by lollusc at 4:29 PM on April 2, 2015
posted by lollusc at 4:29 PM on April 2, 2015
Hadoop+HBase+Solr+Nutch... Beware: It's a very version sensitive setup; there are often compatibility issues between some of the versions of of these components.
posted by Stu-Pendous at 8:24 PM on April 4, 2015
posted by Stu-Pendous at 8:24 PM on April 4, 2015
This thread is closed to new comments.
Apache Lucene & Solr: A library and server specifically built for storing indexes for searching documents.
Scrapy: A python library designed for writing web scrapers.
Selenium: A python (and others) library for communicating and driving a webbrowser. Could be used to more easily scrape pages, as the browser will evaluate the javascript on the page more reliably.
posted by Axle at 6:31 AM on April 2, 2015