<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel>
	  <title>Ask MetaFilter questions tagged with spidering</title>
      <link>http://ask.metafilter.com/tags/spidering</link>
      <description>Questions tagged with 'spidering' at Ask MetaFilter.</description>
	  <pubDate>Thu, 05 Jun 2008 09:42:41 -0800</pubDate> <lastBuildDate>Thu, 05 Jun 2008 09:42:41 -0800</lastBuildDate>

      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>	  
	<item>
	<title>How to comprehensively spider a site</title>
	<link>http://ask.metafilter.com/93305/How%2Dto%2Dcomprehensively%2Dspider%2Da%2Dsite</link>	
	<description>I have been charged with doing a full audit of my company&apos;s &quot;web portal solution&quot;.  This involves me going through the hundreds of pages and essentially developing an incredibly detailed sitemap showing where all pages link back and forth to.  Please help me do this efficiently and accurately - I want to impress. I will add that this &quot;Web portal solution&quot; is indeed online, however it is  password protected, and therefore I have not been able to find a web service that can automate this task.  The ideal solution would create a document that has a tree-type structure, or maybe flowchart layout detailing what children URLS branch off of other parent URLS.&lt;br&gt;
&lt;br&gt;
It gets tricky because there are several external links which do not need to be followed, and several links are just ASP pages (ie .../menupage.asp?pageid=21, .../menupage.asp?pageid=22 etc...)  does this complicate things?&lt;br&gt;
&lt;br&gt;
Is there a firefox add on that can track where I click and then create a logical, visual output of where I visited?  Basically I need something to look at all the links on the page, follow those links to the sub page, then repeat this process until all links in the domain have been followed.&lt;br&gt;
&lt;br&gt;
Any ideas?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.93305</guid>
	<pubDate>Thu, 05 Jun 2008 09:42:41 -0800</pubDate>
	<category>spidering</category>
	<category>webcrawling</category>
	<dc:creator>yoyoceramic</dc:creator>
	</item>
	<item>
	<title>Datamining the public web</title>
	<link>http://ask.metafilter.com/68120/Datamining%2Dthe%2Dpublic%2Dweb</link>	
	<description>How do i build a data warehouse that scrapes data from public websites for my own use? Tools? Tips? Hi. I would like to track apartments on a classifieds site and use the data for analyzing the inpact of diffrent things on price. What i need is a tool or scripting language that would make it easy for me to spider the website and put the data in a database. Preferable this would be an open source solution. &lt;br&gt;
&lt;br&gt;
I am also looking for good tools for extracting information out of longer pieces of text. For example on the site i want to mine users can put in comments on every object. I would like to be able to decide if a comment is positive, negative och neither. I have seen this be done on one online art site that i cant remember the name of right now. The artist used blog post and decided the mood of the writer by what words were used.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2007:site.68120</guid>
	<pubDate>Mon, 30 Jul 2007 01:52:51 -0800</pubDate>
	<category>datamining</category>
	<category>datawarehouse</category>
	<category>scraping</category>
	<category>spidering</category>
	<dc:creator>ilike</dc:creator>
	</item>
	<item>
	<title>&quot;He&apos;s making it up as he goes along!&quot;</title>
	<link>http://ask.metafilter.com/23542/Hes%2Dmaking%2Dit%2Dup%2Das%2Dhe%2Dgoes%2Dalong</link>	
	<description>Why is google spidering specific but non-existent pages on my blog? Over the last couple of days I&apos;ve seen google&apos;s bot scanning my website,  trying to access specific URLS:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:03:32 +0100] &quot;GET /summit/contact.html HTTP/1.1&quot; 200 38112 &quot;-&lt;br&gt;
&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;&lt;br&gt;
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:03:46 +0100] &quot;GET /pages/devonshire.html HTTP/1.1&quot; 200 38114&lt;br&gt;
&quot;-&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;&lt;br&gt;
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:03:57 +0100] &quot;GET /pages/hampton.html HTTP/1.1&quot; 200 38111 &quot;-&quot;&lt;br&gt;
 &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;&lt;br&gt;
www.benzo8.org crawl-66-249-66-3.googlebot.com - - [03/Sep/2005:03:04:09 +0100] &quot;GET /chatsford/contact.html HTTP/1.1&quot; 200 38115&lt;br&gt;
 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
Unfortunately, those URLs don&apos;t exist and they&apos;ve never existed. So why is google requesting them? Is it just guessing, or what? Also, my site is set up so that requests which don&apos;t got to an existing page will go to the index page, so, if google requests these non-existent pages and gets the same content each time (ie: my index page) will it think (incorrectly) that I&apos;ve set up some SEO linkfarm and lower my page-rank as a punishment?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2005:site.23542</guid>
	<pubDate>Fri, 02 Sep 2005 20:05:52 -0800</pubDate>
	<category>blog</category>
	<category>google</category>
	<category>non-existent</category>
	<category>spidering</category>
	<dc:creator>benzo8</dc:creator>
	</item>
	
	</channel>
</rss>

