<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel>
	  <title>Ask MetaFilter posts tagged with scraping</title>
      <link>http://ask.metafilter.com/tags/scraping</link>
      <description>tag posts with scraping</description>
	  	  <pubDate>Wed, 06 Aug 2008 15:06:05 -0800</pubDate>
      <lastBuildDate>Wed, 06 Aug 2008 15:06:05 -0800</lastBuildDate>

      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>	  
	<item>
	<title>Web scraping for dummies</title>
	<link>http://ask.metafilter.com/98518/Web-scraping-for-dummies</link>	
	<description>How does web scraping work with PHP/mySQL? What best practices are there? I&apos;m curious about how price comparison services do and manage web scraping, i.e. finding information in unstructured HTML files over many different sites and presenting the information on their own sites. Ultimately, I would like to learn enough about web scraping so that I can create a functional site that, for example, displays a list of dishes that are linked to various recipe sites.&lt;br&gt;
&lt;br&gt;
Stuff that I wonder about:&lt;br&gt;
1. In general terms, how would you code the project using PHP/mySQL? Any code libraries that can be used for scraping?&lt;br&gt;
&lt;br&gt;
2. I understand that you can regexp data from the scraped html files, but aren&apos;t there more intelligent ways of extracting the data? I&apos;m thinking about XSLT and such.&lt;br&gt;
&lt;br&gt;
3. How do you handle form generated pages? For example, recipe sites that allow you to search form recipe by using check boxes, pull down menus, etc? Again, are there any smart code libraries out there that simplifies this?&lt;br&gt;
&lt;br&gt;
4. Are there any best practices regarding managing scraping, storage, data manipulation, performance, ethics, etc, that I should be aware of?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.98518</guid>
	<pubDate>Wed, 06 Aug 2008 15:06:05 -0800</pubDate>

<category>scraping</category>

<category>php</category>

<category>mysql</category>

<category>programming</category>

<category>coding</category>

	<dc:creator>Foci for Analysis</dc:creator>
	</item>
	<item>
	<title>Gathering ebay infos.</title>
	<link>http://ask.metafilter.com/97355/Gathering-ebay-infos</link>	
	<description>cp /mnt/com/ebay/completed_auctions/ps3 ~/ps3.txt  ?? I have the following pieces of information:&lt;br&gt;
&lt;br&gt;
a. Search criteria&lt;br&gt;
b. Date range.&lt;br&gt;
&lt;br&gt;
And from this; I would like an Excel spreadsheet that has the following columns:&lt;br&gt;
&lt;br&gt;
end price, end date, description, item #, seller, etc&lt;br&gt;
&lt;br&gt;
Is there any way to do such a thing in an automated fashion without screen scraping myself?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.97355</guid>
	<pubDate>Wed, 23 Jul 2008 15:48:29 -0800</pubDate>

<category>ebay</category>

<category>completed</category>

<category>auctions</category>

<category>database</category>

<category>screen</category>

<category>scraping</category>

<category>excel</category>

	<dc:creator>SirStan</dc:creator>
	</item>
	<item>
	<title>Datamining the public web</title>
	<link>http://ask.metafilter.com/68120/Datamining-the-public-web</link>	
	<description>How do i build a data warehouse that scrapes data from public websites for my own use? Tools? Tips? Hi. I would like to track apartments on a classifieds site and use the data for analyzing the inpact of diffrent things on price. What i need is a tool or scripting language that would make it easy for me to spider the website and put the data in a database. Preferable this would be an open source solution. &lt;br&gt;
&lt;br&gt;
I am also looking for good tools for extracting information out of longer pieces of text. For example on the site i want to mine users can put in comments on every object. I would like to be able to decide if a comment is positive, negative och neither. I have seen this be done on one online art site that i cant remember the name of right now. The artist used blog post and decided the mood of the writer by what words were used.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.68120</guid>
	<pubDate>Mon, 30 Jul 2007 01:52:51 -0800</pubDate>

<category>datamining</category>

<category>datawarehouse</category>

<category>spidering</category>

<category>scraping</category>

	<dc:creator>ilike</dc:creator>
	</item>
	<item>
	<title>What is that noise?</title>
	<link>http://ask.metafilter.com/64352/What-is-that-noise</link>	
	<description>Bicyclefilter: A pronounced grinding/scraping from my rear wheel. Is this potentially unsafe, or simply embarrassing, to ride? When I say pronounced, I mean passersby chuckle. The wheel sounds really old, really sad and really grumpy. When I spin the wheel freely, the gears wobble a bit, leading me to think the threads on the gears or the axle are close to stripped (?? I can still pedal just fine, so I don&apos;t know).&lt;br&gt;
&lt;br&gt;
What&apos;s the worst that can happen, realistically? The wheel falls apart? Or I just suddenly find I&apos;m pedalling but not moving?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.64352</guid>
	<pubDate>Fri, 08 Jun 2007 09:52:00 -0800</pubDate>

<category>bike</category>

<category>bicycle</category>

<category>rear</category>

<category>wheel</category>

<category>scraping</category>

<category>grinding</category>

<category>noise</category>

	<dc:creator>poweredbybeard</dc:creator>
	</item>
	<item>
	<title>Web scraping onling banking sites?</title>
	<link>http://ask.metafilter.com/46126/Web-scraping-onling-banking-sites</link>	
	<description>Is there an tool/tutorial to web scrape secure sites like Online Banking or other sites that make sure that it&apos;s you that is logging in? I&apos;ve been self learning web scraping and found it extremely useful, building proxies for my Nokia tablet, automation, through Perl &amp;amp; LWP.&lt;br&gt;
&lt;br&gt;
I want to write something that will grab my bank balance, but those online banking sites seem to go through lots of hoops to make sure you are the one sitting at your computer trying to login.&lt;br&gt;
&lt;br&gt;
I&apos;ve tried deciphering all the tactics they use, javascripts, cookies, but it seems like theres more tricks they are using that I don&apos;t know about.&lt;br&gt;
&lt;br&gt;
Is there some util that lets you analyze the actual behind the scenes events, headers, cookies, etc.. so I can just repeat them?&lt;br&gt;
&lt;br&gt;
A low level tool would be great, but I&apos;d be happy with a web recorder/automation program that would log me in (based on keystrokes, etc) and save the html so I could get my balance.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.46126</guid>
	<pubDate>Thu, 07 Sep 2006 16:28:36 -0800</pubDate>

<category>web</category>

<category>scraping</category>

<category>data</category>

<category>mining</category>

<category>perl</category>

	<dc:creator>mphuie</dc:creator>
	</item>
	<item>
	<title>Image search engine scraper/downloader</title>
	<link>http://ask.metafilter.com/38328/Image-search-engine-scraperdownloader</link>	
	<description>I need a web scraper script/proggie to download and thumbnailize images for a long list of search terms. .&lt;br&gt;
&lt;br&gt;
Input: Long list of search terms.&lt;br&gt;
&lt;br&gt;
Output: corresponding images and thumbnails on my hard drive, arranged/accessed/organized by search term.&lt;br&gt;
&lt;br&gt;
My already feeble google-fough (&amp;lt;-- note ignorance of even the spelling) is failing me.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.38328</guid>
	<pubDate>Tue, 16 May 2006 22:08:34 -0800</pubDate>

<category>web</category>

<category>robots</category>

<category>scripts</category>

<category>scraping</category>

<category>image</category>

<category>thumbnails</category>

<category>searchengines</category>

	<dc:creator>Moistener</dc:creator>
	</item>
	<item>
	<title>Web Scraping for dummies?</title>
	<link>http://ask.metafilter.com/21574/Web-Scraping-for-dummies</link>	
	<description>A project at work has come up, and I would save a lot of time and hassle if I could somehow get my hands on a free (cheap is acceptable, as long as I can try it first), easy to use web-scraping program. The URL from which I will be scraping is static, unencrypted, and otherwise extremely vanilla. Suggestions?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.21574</guid>
	<pubDate>Fri, 22 Jul 2005 11:37:48 -0800</pubDate>

<category>web-scraping</category>

<category>web</category>

<category>scraping</category>

<category>scraper</category>

	<dc:creator>Kwantsar</dc:creator>
	</item>
	<item>
	<title>real estate hacks: scraping together a house</title>
	<link>http://ask.metafilter.com/21147/real-estate-hacks-scraping-together-a-house</link>	
	<description>I&apos;m looking for a home closer to my work, and I&apos;m trying to be systematic about it. My plan is to scrape online real estate databases and put them into a spreadsheet, then start ranking and comparing. So, I guess I have two questions:&lt;br&gt;
&lt;br&gt;
1. What would be the best tool (text wrangler?) to extract fields from an archived html page? (I am using OS X Tiger and am loosely familiar with regex)&lt;br&gt;
&lt;br&gt;
2. Would I probably be better off just driving around key neighborhoods and jotting stuff down by hand?&lt;br&gt;
&lt;br&gt;
(Sample database: &lt;a href=&quot;http://www.utahrealestate.com&quot;&gt;Utah real estate&lt;/a&gt;)</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.21147</guid>
	<pubDate>Wed, 13 Jul 2005 08:11:38 -0800</pubDate>

<category>regex</category>

<category>realestate</category>

<category>scraping</category>

<category>datamunging</category>

<category>archive</category>

	<dc:creator>craniac</dc:creator>
	</item>
	<item>
	<title>Question number 4865</title>
	<link>http://ask.metafilter.com/mefi/4865</link>	
	<description>Is there a way, without constant human intervention, to (1) mine either Google News, Yahoo News, or the AP for new obituaries and (2) drop the name, age, blurb, and URL into a database?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.4865</guid>
	<pubDate>Fri, 23 Jan 2004 17:02:47 -0800</pubDate>

<category>computers</category>

<category>internet</category>

<category>obituary</category>

<category>mining</category>

<category>scraping</category>

<category>database</category>

	<dc:creator>PrinceValium</dc:creator>
	</item>
	
	</channel>
</rss>

